75% found this document useful (4 votes)

3K views476 pages

Control Systems Safety Evaluation and Reliability (Recommend)

Control System Safety Evaluation and Reliability

Uploaded by

jutapol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

75% found this document useful (4 votes)

3K views476 pages

Control Systems Safety Evaluation and Reliability (Recommend)

Control System Safety Evaluation and Reliability

Uploaded by

jutapol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 476

Control Systems

Safety Evaluation
and Reliability
Third Edition

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
Control Systems
Safety Evaluation
and Reliability
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

William M. Goble
Third Edition

Copyright International Society of Automation

Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
Notice
The information presented in this publication is for the general education of the reader. Because
neither the author nor the publisher has any control over the use of the information by the reader,
both the author and the publisher disclaim any and all liability of any kind arising out of such use.
The reader is expected to exercise sound professional judgment in using any of the information pre-
sented in a particular application.
Additionally, neither the author nor the publisher has investigated or considered the effect of
any patents on the ability of the reader to use any of the information in a particular application. The
reader is responsible for reviewing any possible patents that may affect any particular use of the
information presented.
Any references to commercial products in the work are cited as examples only. Neither the
author nor the publisher endorses any referenced commercial product. Any trademarks or trade-
names referenced belong to the respective owner of the mark or name. Neither the author nor the
publisher makes any representation regarding the availability of any referenced commercial prod-
uct at any time. The manufacturers instructions on use of any commercial product must be fol-
lowed at all times, even if in conflict with the information in this publication.

Copyright 2010 International Society of Automation

67 Alexander Drive
P.O. Box 12277
Research Triangle Park, NC 27709

All rights reserved.

Printed in the United States of America.

10 9 8 7 6 5 4 3 2

ISBN 978-1-934394-80-9

No part of this work may be reproduced, stored in a retrieval system, or transmitted

in any form or by any means, electronic, mechanical, photocopying, recording or
otherwise, without the prior written permission of the publisher.

Library of Congress Cataloging-in-Publication Data

Goble, William M.
Control systems safety evaluation and reliability / William M. Goble.
-- 3rd ed.
p. cm. -- (ISA resources for measurement and control series)
Includes bibliographical references and index.
ISBN 978-1-934394-80-9 (pbk.)
1. Automatic control--Reliability. I. Title.
TJ213.95.G62 2010
629.8--dc22
2010015760

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Control System Documentation: Applying Symbols and Identification, 2nd

Edition
Control System Safety Evaluation and Reliability, 3rd Edition
Industrial Data Communications, 4th Edition
Industrial Flow Measurement, 3rd Edition
Industrial Level, Pressure, and Density Measurement, 2nd Edition
Measurement and Control Basics, 4th Edition
Programmable Controllers, 4th Edition
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

This book has been made possible only with the help of many other persons.
Early in the process, J. V. Bukowski of Villanova taught a graduate course in
reliability engineering where I was introduced to the science. This course and
several subsequent tutorial sessions over the years provided the help necessary to
get started.

Many others have helped develop the issues important to control system safety
and reliability. I want to thank co-workers; John Grebe, John Cusimano, Ted Bell,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Ted Tucker, Griff Francis, Dave Johnson, Glenn Bilane, Jim Kinney, and Steve
Duff. They have asked penetrating questions, argued key points, made
suggestions, and provided solutions to complicated problems. A former boss Bob
Adams deserves a special thank you for asking tough questions and demanding
that reliability be made a prime consideration in the design of new products.

Fellow members of the ISA84 standards committee have also helped develop the
issues. I wish to thank Vic Maggioli, Dimitrios Karydos, Tony Frederickson, Paris
Stavrianidis, Paul Gruhn, Aarnout Brombacher, Ad Hamer, Rolf Spiker, Dan
Sniezek and Steve Smith. I have learned from our debates.

Several persons made significant improvements to the document as part of the

review process. I wish to thank Tom Fisher, John Grebe, Griff Francis, Paul
Gruhn, Dan Sniezek, Rainer Faller and Rachel Amkreutz. The comments and
questions from these reviewers improved the book considerably. Julia Bukowski
from Villanova University and Jan Rouvroye of Eindhoven University deserve a
special thank you for their comprehensive and detail review. Iwan van Beurden
of Eindhoven University also deserves a special thank you for a detail review and
check of the examples and exercise answers. I also wish to thank Rick Allen, a
good friend, who reviewed the draft and tried to teach the rules of grammar and
punctuation.

Finally, I wish thank my wife Sandy and my daughters Tyree and Emily for their
patience and help. Everyone helped proofread, type, and check math. While the
specific help was greatly appreciated, it is the encouragement and support for
which I am truly thankful.

vii
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

PREFACE xv

ABOUT THE AUTHOR xvii

Chapter 1 INTRODUCTION 1
Control System Safety and Reliability, 1
Standards, 4
Exercises, 6
Answers to Exercises, 7
References, 7

Chapter 2 UNDERSTANDING RANDOM EVENTS 9

Random Variables, 9
Mean, 18
Variance, 21
Common Distributions, 23
Exercises, 27
Answers to Exercises, 29
References, 31

Chapter 3 FAILURES: STRESS VERSUS STRENGTH 33

Failures, 33
Failure Categorization, 33
Categorization of Failure Stress Sources, 39
Stress and Strength, 46
Electrical Surge and Fast Transients, 55
Exercises, 56
Answers to Exercises, 56
References, 57
ix
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Chapter 4 RELIABILITY AND SAFETY 59

Reliability Definitions, 59
Time to Failure, 59
The Constant Failure Rate, 72
Steady-State Availability Constant Failure Rate Components, 76
Safety Terminology, 78
Exercises, 85
Answers to Exercises, 86
References, 86

Chapter 5 FMEA / FMEDA 87

Failure Modes and Effects Analysis, 87
FMEA Procedure, 87
FMEA Limitations, 88
FMEA Format, 88
Failure Modes, Effects and Diagnostic Analysis (FMEDA), 94
Conventional PLC Input Circuit, 95
Critical Input (High Diagnostic) PLC Input Circuit, 97
FMEDA Limitations, 99
Exercises, 99
Answers to Exercises, 100
References, 100

Chapter 6 FAULT TREE ANALYSIS 103

Fault Tree Analysis, 103
Fault Tree Process, 104
Fault Tree Symbols, 105
Qualitative Fault Tree Analysis, 106
Quantitative Fault Tree Analysis, 108
Use of Fault Tree Analysis for PFDavg Calculations, 114
Using a Fault Tree for Documentation, 116
Exercises, 118
Answers to Exercises, 119
References, 119

Chapter 7 RELIABILITY BLOCK DIAGRAMS 121

Reliability Block Diagrams, 121
Series Systems, 123
Quantitative Block Diagram Evaluation, 137
Exercises, 146
Answers to Exercises, 147
References and Bibliography, 148

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Chapter 8 MARKOV MODELING 149

Repairable Systems, 149
Markov Models, 149
Solving Markov Models, 151
Discrete Time Markov Modeling, 154
Exercises, 176
Answers to Exercises, 177
References, 177

Chapter 9 DIAGNOSTICS 179

Improving Safety and MTTF, 179
Measuring Diagnostic Coverage, 186
Diagnostic Techniques, 190
Fault Injection Testing, 197
Exercises, 197
Answers to Exercises, 198
References, 199

Chapter 10 COMMON CAUSE 201

Common-Cause Failures, 201
Common-Cause Modeling, 205
Common-Cause Avoidance, 211
Estimating the Beta Factor, 213
Estimating Multiple Parameter Common-Cause Models, 215
Including Common Cause in Unit or System Models, 216
Exercises, 220
Answers to Exercises, 220
References, 221

Chapter 11 SOFTWARE RELIABILITY 223

Software Failures, 223
Stress-Strength View of Software Failures, 226
Software Complexity, 229
Software Reliability Modeling, 238
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Software Reliability Model Assumptions, 248

Exercises, 251
Answers to Exercises, 252
References, 253

Chapter 12 MODELING DETAIL 255

Key Issues, 255
Probability Approximations, 256
Diagnostics and Common Cause, 268
Probability of Initial Failure, 278
Comparing the Techniques, 280

Copyright International Society of Automation

In Closing, 281
Exercises, 281
Answers to Exercises, 281
References, 282

Chapter 13 RELIABILITY AND SAFETY MODEL CONSTRUCTION 283

System Model Development, 283
Exercises, 302
Answers to Exercises, 302
References, 303

Chapter 14 SYSTEM ARCHITECTURES 305

Introduction, 305
Single Board PEC, 306
System Configurations, 310
Comparing Architectures, 353
Exercises, 355
Answers to Exercises, 356
References, 357

Chapter 15 SAFETY INSTRUMENTED SYSTEMS 359

Risk Cost, 359
Risk Reduction, 360
How Much RRF is Needed?, 361
SIS Architectures, 366
Exercises, 375
Answers to Exercises, 376
References, 376

Chapter 16 LIFECYCLE COSTING 379

The Language of Money, 379
Procurement Costs, 381
Cost of System Failure, 384
Lifecycle Cost Analysis, 386
Time Value of Money, 389
Safety Instrumented System Lifecycle Cost, 395
Exercises, 397
Answers to Exercises, 398
References, 399

APPENDIX A STANDARD NORMAL DISTRIBUTION TABLE 401

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

APPENDIX B MATRIX MATH 405

The Matrix, 405
Matrix Addition, 406
Matrix Subtraction, 406
Matrix Multiplication, 406
Matrix Inversion, 407

APPENDIX C PROBABILITY THEORY 413

Introduction, 413
Venn Diagrams, 414
Combining Probabilities, 417
Permutations and Combinations, 426
Exercises, 430
Answers to Exercises, 432
Bibliography, 433

APPENDIX D TEST DATA 435

Censored and Uncensored Data, 439

APPENDIX E CONTINUOUS TIME MARKOV MODELING 441

Single Nonrepairable Component, 441
Single Repairable Component, 444
Limiting State Probabilities, 448
Multiple Failure Modes, 450

INDEX 455

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The ability to numerically evaluate control system design parameters, like

safety and reliability, have always been important in order to balance the
tradeoffs between cost, performance and maintenance in control system
design. However, there is more involved than just economics. Proper pro-
tection of personnel and the environment have become the issue. Increas-
ingly, quantitative analysis of safety and reliability is becoming essential
as international regulations require justified and measured safety protec-
tion performance.

The ISA-84.01 standard defines quantitative performance levels for safety

instrumented systems (SIS). New IEC safety standards and the industry
specific companion standards do the same. In general these standards are
not prescriptive, they do not say exactly how to design the system.
Instead, they advise the quantitative safety measurements that must be
met and the designer considers various design alternatives to see which
design meets the targets.

This general approach is very consistent with those who work to economi-
cally optimize their designs. Design constraints must be balanced in order
to provide the optimal design. The ultimate economic success of the pro-
cess is affected by all of the design constraints. True design optimization
requires that alternative designs be evaluated in the context of the con-
straints. Numeric targets and methods to quantitatively evaluate safety
and reliability are the tools needed to include this dimension in the opti-
mization process.

As with many areas of engineering, it must be realized that system safety

and reliability cannot be quantified with total certainty at the present time.
Different assumptions are made in order to simplify the problem. Failure
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`--- xv
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
xvi Control Systems Safety Evaluation and Reliability

rate data, the primary input required for most methods, is not precisely
specified or readily available. Precise failure rate data requires an exten-
sive life test where operational conditions match expected usage.

Several factors prevent this testing. First, current control system compo-
nents from quality vendors have achieved a general level of reliability that
allows them to operate for many, many years. Precise life testing requires
that units be operated until failure. The time required for this testing is far
beyond the usefulness of the data (components are obsolete before the test
is complete). Second, operational conditions vary significantly between
control systems installations. One site may have failure rates that are
much higher than another site. Last, variations in usage will affect reliabil-
ity of a component. This is especially true when design faults exist in a
product. Design faults are probable in the complex components used in
today's systems. Design faults, bugs, are almost expected in complicated
software.

In spite of the limitations of variability, imprecision, simplified assump-

tions, and different methods: rapid progress is being made in the area of
safety and reliability evaluation. ISA standards committees are working in
different areas of this field. ISA84 has a committee working on methods of
calculating system reliability. Several methods that utilize the tools cov-
ered in this book are proposed.

Software reliability has been the subject of intense research for over a
decade. These efforts are beginning to show some results. This is impor-
tant to the subject of control systems because of the explosive growth of
software within these systems. Although software engineering techniques
have provided better design fault avoidance methods, the growth has out-
stripped the improvements. Software reliability may well be the control
system reliability crisis of the future.

Safety and reliability are important design constraints for control systems.
When those involved in the system design share common vocabulary,
understand evaluation methods, include all site variables and understand
how to evaluate reliable software; then safety and reliability can become
true design parameters. This is the goal.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

William M. Goble
Ottsville, PA
April 2010

Copyright International Society of Automation

Dr. William M. Goble has more than 30 years of experience in analog and
digital electronic circuit design, software development, engineering
management and marketing. He is currently a founding member and
Principal Partner with exida, a knowledge company focused on
automation safety and reliability.

He holds a B.S. in electrical engineering from Penn State and an M.S. in

electrical engineering from Villanova. He has a Ph.D. from the
Department of Mechanical Reliability at Eindhoven University of
Technology in Eindhoven, Netherlands, and has done research in
methods of modeling the safety and reliability of automation systems. He
is a Professional Engineer in the state of Pennsylvania and holds a
Certified Functional Safety Expert certificate.

He is a well-known speaker and consultant and also develops and teaches

courses on various reliability and safety engineering topics. He has
written several books and has authored or co-authored many technical
papers and magazine articles, primarily on software and hardware safety
and reliability, and on quality improvement and quantitative modeling.

He is a Fellow Member of the International Society of Automation (ISA)

and is a member of IEEE, AIChE, and several international standards
committees.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

xvii
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Control System Safety and Reliability

Safety and reliability have been essential parameters of automatic control
systems design for decades. It is clearly recognized that a safe and reliable
system provides many benefits. Economic benefits include less lost pro-
duction, higher quality product, reduced maintenance costs, and lower
risk costs. Other benefits include regulatory compliance, the ability to
schedule maintenance, and many othersincluding peace of mind and
the satisfaction of a job well done.

Given the importance of safety and reliability, how are they achieved?
How are they measured? The science of Reliability Engineering has
advanced quite a bit in recent decades. That science offers a number of
fundamental concepts used to achieve high reliability and high safety.
These concepts include high-strength design, fault-tolerant design, on-line
failure diagnostics, and high-common-cause strength. All of these impor-
tant concepts will be developed in later chapters of this book. When these
concepts are actually understood and used, great benefits can result.

Reliability and safety are measured using a number of well-defined

parameters including Reliability, Availability, MTTF (Mean Time To Fail-
ure), RRF (Risk Reduction Factor), PFD (Probability of Failure on
Demand), PFDavg (Average Probability of Failure on Demand), PFS
(Probability of Safe Failure), and other special metrics. These terms have
been developed over the last 60 years or so by the reliability and safety
engineering community.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`--- 1
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
2 Control Systems Safety Evaluation and Reliability

Reliability Engineering
The science of reliability engineering has developed a number of qualita-
tive and semi-quantitative techniques that allow an engineer to under-
stand system operation in the presence of a component failure. These
techniques include failure modes and effects analysis (FMEA), qualitative
fault tree analysis (FTA), and hazard and operational analysis (HAZOPS).
Other techniques based on probability theory and statistics allow the con-
trol engineer to quantitatively evaluate the reliability and safety of control
system designs. Reliability block diagrams and fault trees use combina-
tional probability to evaluate the system-level probability of success, prob-
ability of safe failure, or probability of dangerous failure. Another popular
technique called Markov models shows system success and failure via cir-
cles called states. These techniques will be covered in this book.

Life-cycle cost modeling may be the most useful technique of all to answer
questions of optimal cost and justification. Using this analysis tool, the
output of a reliability analysis in the language of statistics is converted to
the clearly understood language of money. It is frequently quite surprising
how much money can be saved using reliable and safe equipment. This is
especially true when the cost of failure is high.

Reliability engineering is built upon a foundation of probability and statis-

tics. But, a successful control system reliability evaluation depends just as
much on control and safety systems knowledge. This knowledge includes
an understanding of the components used in these systems, the compo-
nent failure modes and their effect on the system, and the system failure
modes and failure stress sources present in the system environment. Thus
logic, systems engineering, and some mathematics are combined to com-
plete the tool-set needed for reliability and safety evaluation. Real-world
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
factorsincluding on-line diagnostic capability, repair times, software
failures, human failures, common-cause failures, failure modes, and time-
dependent failure rates must be addressed in a complete analysis.

Perspective
The field of reliability engineering is relatively new compared to other
engineering disciplines, with significant research having been driven by
military needs in the mid-1940s. Introductory work in hardware reliability
was done in conjunction with the German V2 rocket program, where inno-
vations such as the 2oo3 (two out of three) voting scheme were invented
[Ref. 1, 2]. Human reliability research began with American studies done
on radar operators and gunners during World War II. Military systems
were among the first to reach complexity levels at which reliability engi-
neering became important. Methods were needed to answer important

Copyright International Society of Automation

questions, such as: Which configuration is more reliable on an airplane,

four small engines or two large engines?

Control systems and safety protection systems have also followed an evo-
lutionary path toward greater complexity. Early control systems were sim-
ple. Push buttons and solenoid valves, sight gauges, thermometers, and
dipsticks were typical control tools. Later, single loop pneumatic control-
lers dominated. Most of these machines were not only inherently reliable,
many failed in predictable ways. With a pneumatic system, when the air
tubes leaked, the output went down. When an air filter clogged, the out-
put went to zero. When the hissing noise changed, a good technician
could run diagnostics just by listening to determine where the problem
was. Safety protection systems were built from relays and sensing
switches. With the addition of safety springs and special contacts, these
devices would virtually always fail with the contacts open. Again, they
were simple devices that were inherently reliable with predictable,
(mostly) fail-safe failure modes.

The inevitable need for better processes eventually pushed control sys-
tems to a level of complexity at which sophisticated electronics became the
optimal solution for control and safety protection. Distributed microcom-
puter-based controllers introduced in the mid-1970s offered economic
benefits, improved reliability, and flexibility.

The level of complexity in our control systems has continued to increase,

and programmable electronic systems have become the standard. Systems
today utilize a hierarchical collection of computers of all sizes, from micro-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

computer-based sensors to world-wide computer communication net-

works. Industrial control and safety protection systems are now among
the most complex systems anywhere. These complex systems are the type
that can benefit most from reliability engineering. Control systems design-
ers need answers to their questions: Which control architecture gives the
best reliability for the application? What combination of systems will
give me the lowest cost of ownership for the next five years? Should I
use a personal computer to control our reactor? What architecture is
needed to meet SIL3 safety requirements?

These questions are best answered using quantitative reliability and safety
analysis. Markov analysis has been developed into one of the best tech-
niques for answering these questions, especially when time dependent
variables such as imperfect proof testing are important. Failure Modes
Effects and Diagnostic Analysis (FMEDA) has been developed and refined
as a new tool for quantitative measurement of diagnostic capability. These
new tools and refined methods have made it easier to optimize designs
using reliability engineering.

Copyright International Society of Automation

Standards
Many new international standards have been created in the world of
reliability engineering. Standards now provide detailed methods of
determining component failure rates [Ref. 3]. Standards provide checklists
of issues that should be addressed in qualitative evaluation. Standards
define performance measures against which quantitative reliability and
safety calculations can be compared. Standards also provide explanations
and examples of how systems can be designed to maximize safety and
reliability.

Several of these international standards play an important role in the

safety and reliability evaluation of control systems. The ISA-84.01 stan-
dard [Ref. 4], Applications of Safety Instrumented Systems for the Process
Industries, was a pioneering effort and first described quantitative means
to show safety integrity (Figure 1-1). It also described the boundaries of
the Safety Instrumented System (SIS) and the Basic Process Control Sys-
tem (BPCS). When used with ANSI/ISA-91.01 [Ref. 5], which provides
definitions to identify components of a safety critical system, various plant
equipment can be classified into the proper group.

Safety Integrity Level Probability of Failure Risk Reduction Factor

on Demand (PFDavg.) ($R)

4 10 4 PFDavg q 10 5 10000 a $R 100000

3 10 3 PFDavg q 10 4 1000 a $R 10000

2 10 2 PFDavg q 10 3 100 a $R 1000

1 10 1 PFDavg q 10 2 10 a $R 100

Figure 1-1. Safety Integrity Levels (SIL)

ISA-84.01 also pioneered the concept of a safety life-cycle, a systematic

design process that begins with conceptual process design and ends with
SIS decommissioning. A simplified version of the safety life-cycle chart is
shown in Figure 1-2.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Create
Perform
Conceptual Maintenance -
Conceptual Operations
Process Design
SIS Design Procedures

Maint. - Operations
Hazard and Perform Detail Perform Periodic
Risk Analysis SIS Design

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Testing

Verify Safety
Develop Safety SIS
Requirements - * Requirements
Modification or
Determine SIL Have Been
De-commission
Met

* verification of requirements/SIL

Figure 1-2. Simplified Safety Life-cycle (SLC)

The original ISA-84.01-1996 standard has been replaced by the updated

ANSI/ISA-84.00.01-2004 (IEC 61511 Mod) [Ref. 6]. This standard is almost
word-for-word identical with the IEC 61511 [Ref. 7] standard used world-
wide, except for a clause added to cover existing installations. This stan-
dard is part of a family of international functional safety standards that
cover various industries. The entire family of standards is based on the
IEC 61508 [Ref. 8] standard, which is non-industry-specific and is used as
a reference or umbrella standard for the entire family. Many believe this
family of standards will have more influence on the field of reliability
engineering than any other standard written.

Qualitative versus Quantitative

There is healthy skepticism from some experienced control system engi-
neers regarding quantitative safety and reliability engineering. This might
be a new interpretation of the old quotation, There are lies, damned lies,
and statistics. Quantitative evaluation does utilize some statistical meth-
ods. Consequently, there will be uncertainty in the results. There will be
real variations between predicted results and actual results. There will
even be significant variations in actual results from system site to system
site. This doesnt mean that the methods are not valid. It does mean that
the methods are statistical and generalize many sets of data into one.

Copyright International Society of Automation

The controversy may also come from the experiences that gave rise to
another famous quotation, Garbage in, garbage out. Poor failure rate
estimates and poor simplification assumptions can ruin the results of any
reliability and safety evaluation. Good qualitative reliability engineering
should be used to prevent garbage from going into the evaluation.
Qualitative engineering provides the foundation for all quantitative work.

Quantitative safety and reliability evaluation is a growing science. Knowl-

edge and techniques grow and evolve each year. In spite of variation and
uncertainty, quantitative techniques can be valuable. As Lord Kelvin
stated, but when you cannot express it with numbers, your knowledge
is of a meagre and unsatisfactory kind; it may be the beginning of knowl-
edge, but you have scarcely, in your thoughts, advanced to the stage of
science, whatever the matter may be. The statement applies to control
systems safety and reliability.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Quantitative safety and reliability evaluation can add great depth and
insight into the design of a system and design alternatives. Sometimes
intuition can be deceiving. After all, it was once intuitively obvious that
the world was flat. Many aspects of probability and reliability can appear
counter-intuitive. The quantitative reliability evaluation either verifies the
qualitative evaluation or adds substantially to it. Therein lies its value.

Exercises
1.1 Are methods used to determine safety integrity levels of an indus-
trial process presented in ANSI/ISA-84.00.01-2004 (IEC 61511
Mod)?
1.2 Are safety integrity levels defined by order of magnitude quantita-
tive numbers?
1.3 Can quantitative evaluation techniques be used to verify safety
integrity requirements?
1.4 Should quantitative techniques be used exclusively to verify safety
integrity?

Copyright International Society of Automation

Answers to Exercises
1.1 Yes, ANSI/ISA-84.00.01-2004 (IEC 61511 Mod) describes the con-
cept of safety integrity levels and presents example methods on
how to determine the safety integrity level of a process.
1.2 Yes, in the ISA-84.01-1996, IEC 61508 and ANSI/ISA-84.00.01-2004
(IEC 61511 Mod) standards.
1.3 Yes, if quantitative targets (typically an SIL level and required reli-
ability) are defined as part of the safety requirements.
1.4 Not in the opinion of the author. Qualitative techniques are
required as well in order to properly understand how the system
works under failure conditions. Qualitative guidelines should be
used in addition to quantitative analysis.

References
1. Coppola, A. Reliability Engineering of Electronic Equipment: A
Historical Perspective. IEEE Transactions of Reliability. IEEE, April

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1984.

2. Barlow, R. E. Mathematical Theory of Reliability: A Historical

Perspective. IEEE Transactions of Reliability. IEEE, April 1984.

3. IEC 62380 Electronic Components Failure Rates. Geneva: Interna-

tional Electrotechnical Commission, 2005.

4. ANSI/ISA-84.01-1996 (approved February 15, 1996) - Applications

of Safety Instrumented Systems for the Process Industries. Research Tri-
angle Park: ISA, 1996.

5. ANSI/ISA-91.00.01-2001 - Identification of Emergency Shutdown Sys-

tems and Controls That Are Critical to Maintaining Safety in Process
Industries. Research Triangle Park: ISA, 2001.

6. ANSI/ISA-84.00.01-2004, Parts 1-3 (IEC 61511-1-3 Mod) - Func-

tional Safety: Safety Instrumented Systems for the Process Industry Sec-
tor. Research Triangle Park: ISA, 2004.

Part 1: Framework, Definitions, System, Hardware and Software

Requirements.

Part 2: Guidelines for the Application of ANSI/ISA-84.00.01-2004 Part 1

(IEC 61511-1 Mod) Informative.

Copyright International Society of Automation

Part 3: Guidance for the Determination of the Required Safety Integrity

Levels Informative.

7. IEC 61511-2003 - Functional Safety Safety Instrumented Systems for

the Process Industry Sector. Geneva: International Electrotechnical
Commission, 2003.

8. IEC 61508-2000 - Functional Safety of Electrical/Electronic/Programma-

ble Electronic Safety-related Systems. Geneva: International Electro-
technical Commission, 2000.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Random Variables
The concept of a random variable seems easy to understand and yet many
questions and statements indicate misunderstanding. For example, the
random variable in Reliability Engineering is time to failure. A manager
reads that on average an industrial boiler explodes every fifteen years (the
average time to failure is fifteen years) and knows that the unit in their
plant has been running fourteen years. He calls a safety engineer to deter-
mine how to avoid the explosion next year. This is clearly a misunder-
standing.

We classify boiler explosions and many other types of failure events as

random because with limited statistical operating time data we often can
only predict chances and averages, not specific events at specific times.
Predictions are based on statistical data gathered from a large number of
sources. Statistical techniques are used because they offer the best infor-
mation obtainable, but the timing of a failure event often cannot be pre-
cisely predicted.

The process of failure is like many other processes that have variations in
outcome that cannot be predicted by substituting variables into a formula.
Perhaps the exact formula is not understood. Or perhaps the variables
involved are not completely understood. These processes are called
random (stochastic) processes, primarily because they are not well
characterized.

Some random variables can have only certain values. These random vari-
ables are called discrete random variables. Other variables can have a
numerical value anywhere within a range. These are called continuous

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
9
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
10 Control Systems Safety Evaluation and Reliability

random variables. Statistics are used to gain some knowledge about these
random variables and the processes that produce them.

Statistics
Statistics are usually based on data samples. Consider the case of a
researcher who wants to understand how a computer program is being
used. The researcher calls six computer program users at each of twenty
different locations and asks what specific program function is being used
at that moment. The program functions are categorized as follows:

Category 1 - Editing Functions, such as Cut, Copy, and Paste

Category 2 - Input Functions, such as Data Entry
Category 3 - Output Functions, such as Printing and Formatting
Category 4 - Disk Functions
Category 5 - Check Functions, such as Spelling and Grammar

The results of the survey (sample data) are presented in Table 2-1. This is a
list of data values.

Table 2-1. Computer Program Function Usage

User 1 User 2 User 3 User 4 User 5 User 6

Site 1 1 2 3 2 4 1
Site 2 3 3 3 2 2 1
Site 3 2 2 1 3 3 2
Site 4 1 3 2 2 2 2
Site 5 2 1 4 3 3 2
Site 6 2 2 3 2 3 2
Site 7 1 2 2 3 2 1
Site 8 2 2 2 2 3 2
Site 9 3 3 2 2 2 4
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Site 10 2 2 2 2 2 2
Site 11 2 5 2 2 3 2
Site 12 3 2 2 4 2 2
Site 13 5 2 1 2 2 3
Site 14 2 2 2 3 2 2
Site 15 3 2 2 4 2 2
Site 16 2 2 2 3 2 5
Site 17 2 3 1 2 2 3
Site 18 1 2 2 2 2 2
Site 19 2 2 3 2 3 1
Site 20 2 2 2 2 1 2

Copyright International Society of Automation

Histogram
One of the more common ways to organize data is the histogram (see
Table 2-1). A histogram is a graph with data values on the horizontal axis
and the quantity of samples with each value on the vertical axis. A histo-
gram of data for Table 2-1 is shown in Figure 2-1.

Figure 2-1. Histogram of Computer Usage

EXAMPLE 2-1

Problem: Assume that the computer usage survey results of Figure

2-1 are representative for all users. If another call is made to a user,
what is the probability that the user will be using a function in
category five?

Solution: The histogram shows that three answers from the total of
one hundred and twenty were within category five. Therefore, the
chances of getting an answer in category five are 3/120, which is
2.5%.

Probability Density Function

A probability density function (PDF) relates the value of a random vari-
able with the probability of getting that value (or value range). For discrete
random variables, a PDF provides the probability of getting each result.
For continuous random variables, a PDF provides the probability of get-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

ting a random variable value within a range. The random variable values
typically form the horizontal axis, and probability numbers (a range of 0 to
1) form the vertical axis.

A probability density function has the following properties:

f ( x ) 0 for all x (2-1)

and
n

P ( xi ) = 1 (2-2)
i=1

for discrete random variables or

f ( x ) dx = 1 (2-3)

for continuous random variables.

Figure 2-2 shows a discrete PDF for the toss of a pair of fair dice. There are
36 possible combinations that add up to 11 possible outcomes. The proba-
bility of getting a result of seven is 6/36 because there are six combina-
tions that result in a seven. The probability of getting a result of two is 1/
36 because there is only one combination of the dice that will give that
result. Again, the probabilities total to one.

Figure 2-2. Dice Toss Probability Density Function

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

For a continuous random variable, the PDF again shows probability as a

function of random variables. But with a continuous random variable,
interest is in the probability of getting a process outcome within a range,
not a particular value. Figure 2-3 shows a continuous PDF for a process
that can have an outcome between 100 and 110. All outcomes are equally
likely.

Figure 2-3. Uniform Distribution Probability Density Function

Distributions in which all outcomes are equally likely are called uniform
distributions. The probability of getting an outcome within an interval is
proportional to the length of the interval. For example, the probability of
getting a result between 101.0 and 101.2 is 0.02 (0.2 interval length 0.1
probability). As the length of the interval is reduced, the probability of get-
ting a result within the interval drops. The probability of getting an exact
particular value is zero.

For a continuous PDF, the probability of getting an outcome in the range

from a to b equals

b
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

P(a X b) = f ( x ) dx (2-4)
a

Copyright International Society of Automation

EXAMPLE 2-2

Problem: A uniform PDF over the interval 0 to 1000 for a random

variable X is given by f(x) = 0.001 in the range between 0 and 1000,
and 0 otherwise. What is the probability of an outcome in the range of
110 to 120?

Solution: Using Equation 2-4,

120
120
P ( 110 X 120 ) = 0.001 dx = 0.001x
110
110

= 0.001 ( 120 110 ) = 0.01

EXAMPLE 2-3

Problem: The PDF for the random variable t is given by

f ( t ) = kT, if 0 t 4;

0 otherwise.

What is the value of k?

Solution: Using Equation 2-3 and substituting for f(t):

kt dt = 1
0
which equals
4
2
kt
------- = 1
2
0
Evaluating,
16k
---------- = 1, k = 0.125
2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 2-4

Problem: A random variable T has a PDF of f(t) = 0.01e0.01t if t 0,

and 0 if t < 0. What is the probability of getting a result when T is
between 0 and 50?

Solution: Using Equation 2-4,

t=50
0.01t
P ( 0 t 50 ) = 0.01e dt
t=0

Integrating by substitution (u = 0.01t), we obtain

t=50
0.01t
P ( 0 t 50 ) = 0.01e dt

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
t=0
u=0.5
u du
= 0.01e --------------
0.01
u=0

Simplifying and converting back to t,

u=0.5 0.5 50
0.01t
u=0
u u
e du = e
0
= e 0

Evaluating,

0.01 ( 50 ) 0.01 ( 0 )
= [ e ] [ e ]

= 0.60653 + 1 = 0.39347

Cumulative Distribution Function

A cumulative distribution function (CDF) is defined by the equation
j

F ( xn ) = P ( xi ) (2-5)
i=1

for discrete random variables and

F(x) = f ( x ) dx (2-6)

Copyright International Society of Automation

for continuous random variables. A CDF is often represented by the capi-

tal letter F. The cumulative distribution function represents the cumula-
tive probability of getting a result in the interval from minus infinity to the
present value of x. It equals the area under the PDF curve accumulated up
to the present value.

A cumulative distribution function has the following properties:

dF
F ( ) = 1, F ( ) = 0, and ------ 0
dx

The probability of a random process result within an interval can be

obtained using a CDF.

It can be shown that

P(a X b) = f ( x ) dx = F(b ) F(a) (2-7)

This means that the area under the PDF curve between a and b is simply
the CDF area to the left of b minus the CDF area to the left of a.

EXAMPLE 2-5

Problem: Calculate and graph the discrete CDF for a dice toss
process where the PDF is as shown in Figure 2-2.

Solution: A cumulative distribution function is the sum of previous

probabilities at each discrete variable. Table 2-2 shows the sum of the
probabilities for each result.

Table 2-2. Sum of Probabilities (Example 2-4)

Result Probability Cumulative Probability

2 0.0278 0.0278
3 0.0556 0.0834
4 0.0834 0.1668
5 0.1111 0.2779
6 0.1388 0.4167
7 0.1666 0.5833
8 0.1388 0.7221
9 0.1111 0.8332
10 0.0834 0.9166
11 0.0556 0.9722
12 0.0278 1.000
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The CDF is plotted in Figure 2-4.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 2-4. Dice Toss Cumulative Distribution Function

EXAMPLE 2-6

Problem: Calculate and plot the cumulative distribution function for

the PDF shown in Figure 2-3.

Solution: During the interval from 100 to 110, the PDF equals a
constant, 0.1. The PDF equals zero for other values of x. Therefore,
using Equation 2-6,
x

F( x) = 0.1 dx = 0.1x for 100 x 110

The CDF is zero for values of x less than 100. The CDF is one for
values of x greater than 110. This CDF is plotted in Figure 2-5.

Copyright International Society of Automation

Figure 2-5. Uniform Distribution: Cumulative Distribution Function of Figure 2-3

Mean
The average value of a random variable is called the mean or expected
value of a distribution. For discrete random variables, the equation for
mean, E, is
n

E ( xi ) = x = xi P ( xi ) (2-8)
i=1

The xi in this equation refers to each discrete variable outcome. The n

refers to the number of different possible outcomes. P(xi) represents the
probability of each discrete variable outcome.

EXAMPLE 2-7

Problem: Calculate the mean value of a dice toss process.

Solution: Using Equation 2-8,

E ( x ) = 2 -----
1 + 3 -----
2 + 4 -----
3 + 5 -----
4
36- 36- 36- 36-

+ 6 -----
5 + 7 -----
6 + 8 -----
5 + 9 -----
4
36- 36- 36- 36-

+ 10 -----
3 + 11 -----
2 + 12 -----
1
36- 36- 36-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

= 252
---------- = 7
36

Copyright International Society of Automation

Another, more common form of mean is given by Equation 2-9:

n
xi
i=1 -
x = -------------- (2-9)
n

where xi refers to each data sample in a collection of n samples. This is

called the sample mean. As the number of samples grows larger, the
results obtained from Equation 2-9 will get closer to the results obtained
from Equation 2-8. When a large number of samples are used, the proba-
bilities are inherently included in the calculation.

EXAMPLE 2-8

Problem: A pair of fair dice is tossed ten times. The results are 7, 2,
4, 6, 10, 12, 3, 5, 4, and 2. What is the sample mean?

Solution: Using Equation 2-9, the samples are added and divided by
the number of samples. The answer is 5.5. Compare this to the
answer of seven obtained in Example 2-7. The difference of 1.5 is
large but not unreasonable since our sample size of ten is small.

EXAMPLE 2-9

Problem: The dice toss experiment is continued. The next 90

samples are 6, 8, 5, 7, 9, 4, 7, 11, 3, 6, 10, 7, 4, 6, 4, 8, 8, 5, 9, 3, 2,
11, 9, 8, 2, 9, 5, 11, 12, 8, 9, 7, 10, 6, 6, 8, 10, 7, 8, 11, 6, 7, 9, 8, 7, 8,
7, 10, 7, 6, 8, 7, 9, 11, 7, 7, 12, 7, 7, 9, 5, 6, 3, 7, 6, 10, 8, 7, 8, 6, 10,
7, 11, 6, 9, 11, 5, 8, 7, 4, 6, 8, 9, 7, 10, 9, 11, 7, 4, and 7. What is the
sample mean?

Solution: Using Equation 2-9, the samples are added and divided by
the number of samples. The total of all one hundred samples
(including those from Example 2-8) is 725. Dividing this by the
number of samples yields a sample mean result of 7.25. This result is
closer to the distribution mean obtained in Example 2-7.

For continuous random variables, the equation for mean value is

E(x) = xf ( x ) dx (2-10)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 2-10

Problem: The PDF for a random variable is given by f(t) = 1/8 t, if

0 t 4; 0 otherwise. What is the mean?

Solution: Using Equation 2-10,

t --8- tdt
1
E(t) =
0

Integrating,
4
3
1t 64
E ( t ) = --- ---- = ------
83 24
0

The mean represents the center of mass of a distribution. Example

2-10 shows this attribute. The distribution spans values from zero to
four. The mean is 64/24 (2.667). This value is greater than the
median (middle) value of two.

Median
The median value (or simply, median) of a sample data set is a measure
of center. The median value is defined as the data sample where there are
an equal number of samples of greater value and lesser value. It is the
middle value in an ordered set. If there are two middle values, it is the
value halfway between those two values.

Knowing both the mean and the median allows one to evaluate symmetry.
In a perfectly symmetric PDF, such as a normal distribution, the mean and
the median are the same. In an asymmetric PDF, such as the exponential,
the mean and the median are different.

EXAMPLE 2-11

Problem: Find the median value of a dice toss process.

Solution: The set of possible values is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,

and 12. Since there are eleven values, the middle value would be the
sixth. Counting shows the median value to be seven.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 2-12

Problem: Find the median value of the sample data presented in

Example 2-8.

Solution: Sorting the 10 values provides the list: 2, 2, 3, 4, 4, 5, 6, 7,

10, and 12. Two middle values are present, a four and a five.
Therefore, the median is halfway between these two at 4.5.

Variance
As noted, the mean is a measure of the center of mass of a distribution.
Knowing the center is valuable, but more must be known. Another impor-
tant characteristic of a distribution is spreadthe amount of variation. A
process under control will be consistent and will have little variation. A
measure of variation is important for control purposes [Ref. 1, 2].

One way to measure variation is to subtract values from the mean. When
this is done, the question, How far from the center is the data? can be
answered mathematically. For each discrete data point, calculate

( xi x ) (2-11)

In order to calculate variance, these differences are squared to obtain all

positive numbers and then averaged. When there is a given discrete prob-
ability distribution, the formula for variance is

2 n 2
( xi ) = ( xi x ) P ( xi ) (2-12)
i=1

where xi refers to each discrete outcome and P(xi) refers to the probability
of realizing each discrete outcome. If a set of sample data is given, the for-
mula for sample variance is

n
2
( xi x )
2 i=1
s ( x i ) = ------------------------------- (2-13)
n

where xi refers to each piece of data in the set. The size of the data set is
given by n. As in the case of the sample mean, the sample variance will
provide a result similar to the actual variance as the sample size grows
larger. But, there is no guarantee that these numbers will be equal. For this
reason, the sample variance is sometimes called an estimator variance.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

For continuous random variables, the variance is given by

2 2
(x) = (x x) f ( x ) dx (2-14)

Standard Deviation
A standard deviation is the square root of variance. The formula for stan-
dard deviation is

2
(x) = (x) (2-15)

Standard deviation is often assigned the lowercase Greek letter sigma ().
It is a measure of spread just like variance and is most commonly associ-
ated with the normal distribution.

EXAMPLE 2-13

Problem: What is the variance of a dice toss process?

Solution: In Example 2-6, the mean of a dice toss process was

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
calculated as seven. Using Equation 2-12,

2 21 22 23 24
( x i ) = ( 2 7 ) ------ + ( 3 7 ) ------ + ( 4 7 ) ------ + ( 5 7 ) ------
36 36 36 36
25 26 25 24
+ ( 6 7 ) ------ + ( 7 7 ) ------ + ( 8 7 ) ------ + ( 9 7 ) ------
36 36 36 36
23 22 21
+ ( 10 7 ) ------ + ( 11 7 ) ------ + ( 12 7 ) ------
36 36 36
= 5.834

EXAMPLE 2-14

Problem: What is the sample variance of the dice toss data

presented in Example 2-9?

Solution: Using Equation 2-13, the sample mean of 7.25 is

subtracted from each data sample. This difference is squared. The
squared differences are summed and divided by 100. The answer is
5.9075. This result is comparable to the value obtained in Example 2-
13.

Copyright International Society of Automation

Common Distributions
Several well known distributions play an important role in reliability engi-
neering. The exponential distribution is widely used to represent the proba-
bility of component failure over a time interval. This is a direct result of a
constant failure rate assumption. The normal distribution is used in many
areas of science. In reliability engineering it is used to represent strength
distributions. It is also used to represent stress. The lognormal distribution
is used to model repair probabilities.

Exponential Distribution
The exponential distribution is commonly used in the field of reliability. In
its general form it is written

kx
f ( x ) = ke , for x 0 ; (2-16)
= 0, for x < 0

The cumulative distribution function is

kx
F ( x ) = 0, for x < 0 ; = 1 e , for x 0 (2-17)

The equation is valid for values of k greater than zero. The CDF will reach
the value of one at x = (which is expected). Figure 2-6 shows a plot of an
exponential distribution PDF and CDF where k = 0.6. Note: PDF(x) =
d[CDF(x)]/dx.

Figure 2-6. Exponential Distribution

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Normal Distribution
The normal distribution is the most well known and widely used proba-
bility distribution in general science fields. It is so well known because it
applies (or seems to apply) to so many processes. In reliability engineer-
ing, as mentioned, it primarily applies to measurements of product
strength and external stress.

The PDF is perfectly symmetrical about the mean. The spread is mea-
sured by variance. The larger the value of variance, the flatter the distribu-
tion. Figure 2-7 shows a normal distribution PDF and CDF where the

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
mean equals four and the standard deviation equals one. Because the PDF
is perfectly symmetric, the CDF always equals one half at the mean value
of the PDF. The PDF is given by

2
( x )-
---------------------
2
1 2
f ( x ) = -------------- e (2-18)
2

where is the mean value and is the standard deviation. The CDF is given
by

2
(x )
----------------------
x 2
1 2
2
F ( x ) = -------------- e dx (2-19)

Figure 2-7. Normal Distribution

Copyright International Society of Automation

There is no closed-form solution for Equation 2-19, i.e., a formula cannot

be written in the variable x which will give us the value for F(x). Fortu-
nately, simple numerical techniques are available and standard tables
exist to provide needed answers.

Standard Normal Distribution

If
(x )
z = ----------------- (2-20)

is substituted into the normal PDF function,

2
z
----
1 2
f ( z ) = ---------- e (2-21)
2

This is a normal distribution that has a mean value of zero and a standard
deviation of one. Tables can be generated (Appendix A and Ref. 2, Table
1.1.12) showing the numerical values of the cumulative distribution func-
tion. These are done for different values of z. Any normal distribution
with any particular and can be translated into a standard normal dis-
tribution by scaling its variable x into z using Equation 2-20. Through the
use of these techniques, numerical probabilities can be obtained for any
range of values for any normal distribution.

EXAMPLE 2-15

Problem: A control system will fail if the ambient temperature

exceeds 70C. The ambient temperature follows a normal distribution
with a mean of 40C and a standard deviation of 10C. What is the
probability of failure?
Solution: The probability of getting a temperature above 70C must
be found. Using Equation 2-20,
( 70 40 )
z = ------------------------ = 3
10
Checking the chart in Appendix A for a value of 3.00, the left column
(0.00 adder) across from 3.0, a value of 0.99865 is found. This
means that 99.865% of the area under the PDF occurs at or below
the temperature of 70C:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

P ( T 70 ) = 0.99865

Copyright International Society of Automation

EXAMPLE 2-15 continued

To obtain a probability for temperatures greater than 70C, use the
rule of complementary events (see Appendix C). Since the system
will fail as soon as the temperature goes above 70C, this is the
probability of failure.
P ( T > 70 ) = 1 0.99865 = 0.00135
This is shown in Figure 2-8.

Lognormal Distribution
The random variable X has a lognormal PDF if lnX has a normal distribu-
tion. Thus, the lognormal distribution is related to the normal distribution
and also has two parameters.

The lognormal distribution has been used to model probability of comple-

tion for many human activities, including repair time. It has also been
used in reliability engineering to model uncertainty in failure rate infor-
mation. Figure 2-9 shows the PDF and CDF of a lognormal distribution
with a of 0.1 and a of 1. It should be noted that these are not the mean
and standard deviations of the distribution.

The PDF is given by

2
( ln t ) -
--------------------------
2
1 ( 2 )
f ( t ) = ----------------- e , for t 0 ; = 0, for t < 0 (2-22)
t 2

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 2-8. Example 2-15 Distribution

Copyright International Society of Automation

Figure 2-9. Lognormal Distribution

Exercises
2.1 A failure log is provided. Create a histogram of failure categories:
software failure, hardware failure, operations failure and mainte-
nance failure.
FAILURE LOG - 10 sites
Symptom Failure Category
1. PC crashed Software
2. Fuse Blown Hardware
3. Wire shorted during repair Maintenance
4. Program stopped during use Software
5. Unplugged wrong module Maintenance
6. Invalid display Software
7. Scan time overrun Software
8. Water in cabinet Maintenance
9. Wrong command Operations
10. General protection fault Software
11. Cursor disappeared Software
12. Coffee spilled Hardware
13. Dust on thermocouple terminal Hardware
14. Memory full error Software
15. Mouse nest between circuits Hardware
16. PLC crashed after download Software
17. Output switch shorted Hardware
18. LAN addresses duplicated Maintenance
19. Computer memory chip failed Hardware
20. Power supply failed Hardware
21. Sensor switch shorted Hardware
22. Valve stuck open Hardware
23. Bypass wire left in place Maintenance
24. PC virus causes crash Software
25. Hard disk failure Hardware

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Draw a Venn diagram (see Appendix C for a Venn diagram expla-

nation) of the failure types.
2.2 Based on the histogram of exercise 2.1, what is the chance of the
next failure being a software failure?
2.3 A discrete random variable has the possible outcomes and proba-
bilities listed below.

Outcome Probability
1 0.1
2 0.2
3 0.5
4 0.1
5 0.1

Plot the PDF and the CDF.

2.4 What is the mean of exercise 2.3?
2.5 What is the variance of exercise 2.3?
2.6 The following set of numbers is provided from an accelerated life
test in which power supplies were run until failure (days until fail-
ure). What is the mean of this data?

75 95 110 112 121 125 140 174 183

228 250 342 554 671 823 1065 1289 1346

What is the average failure time in days? What is the median fail-
ure time in days?
2.7 A controller will fail when the temperature gets above 80C. The
ambient temperature follows a normal distribution with a mean of
40C and a standard deviation of 10C. What is the probability of
failure?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Answers to Exercises
2.1 Software failures - 9, Hardware failures - 10, Maintenance failures -
5, Operations failures - 1. The histogram is shown in Figure 2-10.
The Venn diagram is shown in Figure 2-11.

Figure 2-10. Histogram of Failure Types

Figure 2-11. Venn Diagram of Failure Types

2.2 The chances that the next failure will be a software failure are 9/25.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

2.3 Figure 2-12 shows the PDF. Figure 2-13 shows the CDF.
2.4 The mean is calculated by multiplying the values times the proba-
bilities per equation 2-8. Mean = 1 0.1 + 2 0.2 + 3 0.5 + 4 0.1
+ 5 0.1 = 2.9.
2.5 The variance is calculated using equation 2-12. Answer 1.09.

Copyright International Society of Automation

Figure 2-12. PDF of Problem 2-3

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 2-13. CDF for Problem 2-3

2.6 Mean of the data equals 427.94 days. Average failure time in hours
equals 427.94 24 = 10271 hours. Median failure time in days
equals (183+228)/2 = 205.5. In hours = 4932 hours. Note that this
assumes that failure occurred precisely at the end of each 24-hour
day. This is not accurate but a common assumption.
2.7 Using a standard normal distribution, z = (80 40)/10 = 4.00. From
Appendix A at z = 4.00, 0.999969. Probability of failure then equals
1 0.999969 = 0.000031.

Copyright International Society of Automation

References
1. Mamzic, C. L. Statistical Process Control. Research Triangle Park:
ISA, 1996.

2. Avallone, E. A. and Baumeister, T. III (eds.) Marks Standard Hand-

book for Mechanical Engineers, Ninth Edition. New York: McGraw-
Hill, 1986.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Failures
A failure occurs when a system, a unit, a module, or a component fails to
perform its intended function. Control systems fail. Everyone understands
that this potentially expensive and potentially dangerous event can occur.
To prevent future control system failures, the causes of failure are studied.
When failure causes are understood, system designs can be improved. To
obtain sufficient depth of understanding, all levels of the system must be
examined. For safety and reliability analysis purposes, four levels are
defined in this book. A system is built from units. If the system is redun-
dant, multiple units are used. Non-redundant systems use a single unit.
Units are built from modules and modules are built from components (see
Figure 3-1). Many real control systems are constructed in just this manner.
Although system construction could be defined in other ways, these levels
are optimal for use in safety and reliability analysis, especially the analysis
of fault tolerant systems. These terms will be used throughout the book.

Failure Categorization
A group of control system suppliers and users once brainstormed at an
ISA standards meeting on the subject of failures. A listing of failure
sources and failure types resulted:

Humidity
Software bugs
Temperature
Power glitches
--``,,`,,,`,,`,`,,,```,,,``,``,

33
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
34 Control Systems Safety Evaluation and Reliability

SYSTEM

UNIT

MODULE

COMPONENT

Figure 3-1. Levels of Reliability Analysis

Stuck pneumatic actuator

Dirty air
System design errors
Electrostatic discharge (ESD)
Sticky O-rings
Broken wires
Corrosion-induced open circuits
Random component failures
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Repair technician error

Radio frequency interference (RFI)
Operator closing wrong switch
Vibration
Broken structural member
Improper grounding
Wrong configuration loaded
Incorrect replacement
Wrong software version installed

This list is quite diverse and includes a number of different failure types,
as well as failure sources, which need to be sorted if any good understand-

Copyright International Society of Automation

ing is to be achieved. Failures must be categorized as to both type and

source so that we can record failure data in a consistent way and learn
about better system design.

When thinking about all system failures, two types emerge: random fail-
ures and systematic failures (Figure 3-2). The functional safety standard
IEC 61508 [Ref. 1] defines a random failure as Failure occurring at a ran-
dom time which results from one or more of the possible degradation
mechanisms in the hardware. It does note that There are many degrada-
tion mechanisms occurring at different rates in different components and,
since manufacturing tolerances cause components to fail due to these
mechanisms after different times in operation, failures of equipment com-
prising many components occur at predictable rates but unpredictable
(i.e., random) times.

Random failures are relatively well understood. Databases are kept on

random failures by industry experts [Ref. 2], the military [Ref. 3], and by
industry groups [Ref. 4, 5].

FAILURES

Random (Physical) Systematic (Functional)

Figure 3-2. Failure Types

Other failures are called functional failures (or systematic failures). A sys-
tematic failure occurs when no components have failed, yet the system
does not perform its intended function. IEC 61508 defines a systematic
failure as a Failure related in a deterministic way to a certain cause,
which can only be eliminated by a modification of the design or of the
manufacturing process, operational procedures, documentation or other
relevant factor.

An example of this is the common software crash. A piece of data is

entered into a personal computer and, when the return key is pressed, the
computer simply quits. The computer has failed to accomplish its
intended function, but no component has failedno physical failure has
occurred. The computer re-starts properly after the reset button is

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

pressed. The failure may or may not even appear repeatable depending on
the data entered.

Most systematic failures are a result of design faults. Systematic failures

may be permanent or transient. Failure rate data for systematic failures is
currently hard to get; however, future possibilities exist. Original,
unedited failure logs from system sites have been suggested as a good
source for this data. Newer computer-based systems have automatic diag-
nostic logs that record error events as a function of time. These promise to
provide additional information about systematic failure rates in the
future.

There is debate about how to classify some failures [Ref. 6]. One view is
that all failures are systematic because an engineer should have properly
anticipated all stress conditions and designed the system to withstand
them! Different analysts may classify the same failure as random or sys-
tematic with the distinction often depending on the assumptions made by
the analyst. If a component manufacturer specifies a strict requirement for
clean air per an ISA standard and a failure is caused by intermittent bad
air caused by failure of the air filter, then the manufacturer will likely clas-
sify the failure as systematic. The (unrealistic) assumption is that the com-
ponent user can design an air supply system that never allows dirty air.

The component user, who designed and installed the air filter system,
knows that air quality can vary randomly due to random events (e.g., air
filter failure) and would likely classify the same failure as random. Of
course more than a few of these random failures will likely convince the
component user to find a better designed component that can withstand
random air quality failure events.

Both random and systematic failures have attributes that are important to
control system safety and reliability analysis. This failure information is
needed to help determine how to prevent future failures of both types.
Information is recorded about the failure source and the effect on the con-
trol systems function. The term failure stress source is used to represent
the cause of a failurethe cause of death. This information should
include all the things that have stressed the system. The failure stress
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

source may be something that causes an increase in stress or it may be

something that causes a decrease in the ability to resist a stress, called
strength. These are the degradation mechanisms discussed in the IEC
61508 definition of random failure. Table 3-1 shows the failure classifica-
tions of our previous failure stress source/failure type list.

Copyright International Society of Automation

Table 3-1. Classification Listing

Item Class
Humidity External failure stress source
Software bugs Internal design failure stress source
Temperature External failure stress source
Power glitches External failure stress source
Stuck pneumatic actuator Failure type - random failure
Dirty air External failure stress source
System design errors Internal design failure stress source
Electrostatic Discharge (ESD) External failure stress source
Sticky O-ring Failure type - random failure
Broken wires Failure type - random failure
Corrosion-induced open circuits Failure type - random failure
Random component failures Failure type - random failure
Repairman error External failure stress source
RFI External failure stress source
Operator closing wrong switch Failure type - systematic failure
Vibration External failure stress source
Broken structural member Failure type - random failure
Improper grounding External failure stress source
Wrong configuration loaded Failure type - systematic failure
Incorrect replacement Failure type - systematic failure
Wrong software version installed Failure type - systematic failure

Random Failures
Random failures may be permanent (hard error) or transient (soft error).
In some technologies a random failure is most often permanent and attrib-
utable to some component or module. For example, a system that consists
of a single-board controller module fails. The controller output de-ener-
gizes and no longer supplies current to a solenoid valve. The controller
diagnostics identify a bad output transistor component. A failure analysis
of the output transistor shows that it would not conduct current and failed
with an open circuit. The failure occurred because a thin bond wire inside
the transistor melted. Plant Operations reports a nearby lightning strike
just before the controller module failed. Lightning causes a surge of elec-
trical stress that can exceed the strength of a transistor. It should be noted
that lightning is considered a random event. For this failure:

Title: Module de-energized

Failure: Output transistor failedmelted bond wire
Failure Type: Random
Primary Failure Stress Source: Electrical surge
Secondary Failure Stress Source: Temperature

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

In another example of a random failure, a system failed because a power

supply module failed with no output. An examination of the power sup-
ply showed that it had an open electrolytic capacitor (a capacitor with a
value far below its rating, which is virtually the same as an open circuit).
The capacitor failure analysis showed that the electrolyte had evaporated.
Without electrolyte, these capacitors cannot hold a charge and become
open circuits. Evaporation is a process that occurs over time. Eventually,
all electrolyte in a electrolytic capacitor will evaporate and the capacitor
will wear out. Electrolyte evaporation can be viewed as a strength reduc-
tion process. Eventually the strength approaches normal stress levels and
the component fails. For this failure:

Title: Power supply failed

Failure: Electrolytic capacitor failedevaporation
Failure Type: Random
Primary Failure Stress Source: Electrical voltage, current
Secondary Failure Stress Source: Temperature, weak case
design

Systematic Failures
If all the physical components of a control system are working properly
and yet the system fails to perform its function, that failure is classified as
systematic. Systematic (or functional) failures have the same attributes as
physical failures. Each failure has a failure stress source and an effect on
the control function. The failure stress sources are almost always design
faults, although sometimes a maintenance error or an installation error
causes a systematic failure.

The exact source of a systematic failure can be terribly obscure. Often the
failure will occur when the system is asked to perform some unusual func-
tion, or perhaps the system receives some combination of input data that
was never tested. Some of the most obscure failures involve combinations
of stored information, elapsed time, input data, and function performed.
The ability to resist such stress may be considered a measure of strength
in the design.

For example, a computer-based control system de-energized its outputs

whenever a tuning parameter was set to a value below 0.0001 and the
ramping option was enabled. The problem was traced to a design fault
in the software that only occurred when a particular program path was
executed by the computer. That particular sequence is only executed when
the tuning parameter value is below 0.0001 and the ramping option is
enabled. The testing done by the software designers did not test that par-
ticular combination. For this failure:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Title: Output module de-energized

Failure: Software fault
Failure Type: Systematic
Primary Failure Stress Source: Parameter below 0.0001
Secondary Failure Stress Source: Ramping option enabled

In another example, a control system did not properly annunciate (notify

operator about) a high pressure pre-alarm. An expensive rupture disk on a
tank blew open as a result. The failure autopsy showed that a two wire (4
20 mA) pressure transmitter could not deliver a full scale signal above the
trip point of 18 mA. The current loop did not have enough compliance
voltage because wire resistance in the current loop was too high. The sys-
tem designer specified 28-gage wire when a lower resistance 16-gage wire
should have been specified. This is a systematic failure since no physical
component failed. For this failure:

Title: Pre-alarm annunciation failure

Failure: Current signal limited below trip point
Failure Type: Systematic
Primary Failure Stress Source: Design faultwrong wire size
Secondary Failure Stress Source: High electrical current

In a well publicized systematic failure of the U.S. telephone network [Ref.

7] 20 million customers lost service for several hours. When the number of
messages (the stress) received by a packet switch exceeded the quantity
that could be stored in a memory buffer (the strength), the switch would
shut itself down instead of discarding excess messages. Messages were
then routed to other switches which would become overloaded, cascading
the problem across wide regions. The problem was traced to a line of
software code. The software was changed to improve the design. For this
failure:

Title: Telephone network failed

Failure: Software faultmemory buffer too small; no degraded
operation permitted
Failure Type: Systematic
Primary Failure Stress Source: Number of messages
Secondary Failure Stress Source: None identified

Categorization of Failure Stress Sources

Many different things can cause failure, individually or in combination. It
is important to take a broad, open look at these things. A categorization
scheme for failure stress sources will help clear the picture.
--``,,`,,,`,,`,`,,,```,,,``,`

Copyright International Society of Automation

Failure stress sources are either internal or external to the system (see Fig-
ure 3-3). Internal failure stress sources typically result in decreased
strength. Internal sources include design faults (product) and manufactur-
ing faults (process). The faults can occur at any level: component, module,
unit, or system. External failure stress sources increase stress. They
include environmental sources, maintenance faults, and operational
faults. A failure is often due to a combination of internal weakness
(decreased strength) and external stress. Most, if not all, failures due to
external stress are classified as random. Internal failure stress sources can
be classified as either random or systematic.

Failure Stress Sources

Internal (Strength) External (Stress)

Manufacturing Design Environmental Maintenance Operational

Figure 3-3. Failure Stress Sources

Internal Design
Design faults (insufficient strength) can cause system failures. They are a
major source of systematic failures. Many case histories are recorded [Ref.
8, 9, 10]. In some cases the designer did not understand how the system
would be used. Sometimes the designer did not anticipate all possible
environmental or operational conditions. Different designers working on
different parts of the system may not have completely understood the
whole system. These faults may apply to component design, module
design, unit design, or system design.

Consider the example of a microcomputer-based controller module. Occa-

sionally the controller module stops workingit freezes. After a short
time period, an independent watchdog timer circuit within the module
resets the microprocessor. The module then restarts normal operation. The
failure is transient. Obviously, no physical failure has occurred. After
extensive troubleshooting is done, the failure analysis report shows that at
high temperatures a logic gate occasionally responds to an electrical noise
pulse and clocks the microprocessor too rapidly. The microprocessor mis-
interprets an instruction and stops operating. Although environmental
noise (a random electrical stress) triggers the failure, a design error has
caused this systematic failure. The designer did not understand the envi-
ronment and intended usage of the logic circuit. The design was not

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

strong enough to withstand the anticipated environmental stress includ-

ing foreseeable misuse; there was not enough safety factor in the design.
For this failure:

Title: Transient controller module failure

Failure: Hardware design faultgate noise
Failure Type: Systematic
Primary Failure Stress Source: Electrical noise
Secondary Failure Stress Source: Temperature

It is hard for designers to anticipate all possible input conditions. In a well-

publicized case, an early computer system (NORAD, North American Air
Defense, October 5, 1960), designed to detect missiles heading for the
United States, signaled an alarm. It had been in operation less than a
month. It was a false alarm. The BMEWS (Ballistic Missile Early Warning

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
System) radar in Thule, Greenland, had detected the rising moon. The
designers did not anticipate this input. Reports indicate that this design
was strengthened via a redesign, shortly after the failure occurred. This
is characteristic of design faults that are properly remedied: strength
increases. For this failure:

Title: False alarm on moon

Failure: Software faultunanticipated input
Failure Type: Systematic
Primary Failure Stress Source: Moon rising signal from sensor
Secondary Failure Stress Source: None identified

Operational procedures are an important part of any design, and design

faults can extend to these documents as well. A calibration procedure for a
pressure sensor was written with an error that caused the engineering
units span parameter to be entered into the zero location. The sensor was
badly out of calibration as a result of this error and failed to perform its
function. For this failure:

Title: Erroneous calibration procedure

Failure: Procedure design faultbad parameter reference
Failure Type: Systematic
Primary Failure Stress Source: Normal calibration
Secondary Failure Stress Source: Maintenance personnel did
not recognize the problem

The primary defense against a design fault is a thorough, rigorous design

process like the process defined in IEC 61508 [Ref. 1]. Important attributes
of the process include good specifications, proven design methods, docu-
mented analysis, and careful review by qualified experts [Ref. 11, 12]. Var-
ious design review techniques are in use, including an active review

Copyright International Society of Automation

process where a second party (not the designer) describes the design to a
group of reviewers. Other techniques involve the use of experienced per-
sonnel who specifically look for design faults.

Internal Manufacturing
Manufacturing defects occur when some step in the process of manufac-
turing a product (component, module, unit, or system) is done incorrectly.
For example, a controller module failed when it no longer supplied cur-
rent to a valve positioner. It was discovered that a resistor in the output
circuit had a corroded wire that no longer conducted electricity. This was
a surprise since the resistor wire is normally coated to protect against cor-
rosion. A closer examination showed that the coating had not been
applied correctly when the resistor was made. Small voids in the coating
occurred because the coating machine had not been cleaned on schedule.
For this failure:

Title: Output module failed de-energized

Failure: Corroded wire on resistor component
Failure Type: Random
Primary Failure Stress Source: Corrosive atmosphere
Secondary Failure Stress Sources: Temperature, humidity,
strength decreased by coating voids

Manufacturing defects of this type lower the strength of a product (com-

ponent, module, unit or system); they make the product more susceptible
to failure stress sources. These defects can also significantly shorten prod-
uct life. Other examples of manufacturing defects include bad solder joints
on printed circuit boards, a component installed backwards, the wrong
size O-ring, a missing part, or an over-torqued bolt. Defects in raw mate-
rial are another example of manufacturing defects.

Fortunately, many of these manufacturing defects can be detected by

stress testing techniques (Ref. 13). Such techniques include the use of ele-
vated temperatures, temperature cycles (rapid hot and cold), elevated
voltages, and/or high humidity. Manufacturers can use accelerated stress
testing, often called burn-in, on newly manufactured product. This tech-
nique is intended to cause manufacturing faults to quickly result in fail-
ures. It can weed out many manufacturing defects, decreasing the in-
service failure rate due to manufacturing faults. Early failures, often due
to manufacturing defects, are called infant mortality failures.

External Environmental
Control systems are used in industrial environments where many things
are present that help cause failures (Figure 3-4).

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

External -
Environmental Failure
Sources

Chemical Electrical Mechanical Physical

Temperature Humidity

Figure 3-4. ExternalEnvironmental Failure Stress Sources

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Temperature can directly cause failure. A PC based industrial computer
module that performs a data logging function is mounted in a ventilated
shed. One hot day, the ventilation fan fails and the temperatures reach
65C. The power supply overtemperature circuit shut down the computer.
For this failure:

Title: Power supply shutdown

Failure: Power shutdown when ambient temperature exceeded
maximum
Failure Type: Random
Primary Failure Stress Source: Temperature
Secondary Failure Stress Source: Ventilation failure

Many other industrial failure stress sources are present. Large electrical
currents are routinely switched, generating wideband electrical noise.
Mechanical processes such as mixing, loading, and stamping cause shock
and vibration. Consider an example: A controller module mounted near a
turbine had all its calibration data stored in a programmable memory
mounted in an integrated circuit socket. After several months of opera-
tion, the controller failed. Its outputs went dead. An examination of the
failed controller module showed that the memory had vibrated out of the
socket. The controller computer was programmed to safely de-energize
the outputs if calibration data was lost. For this failure:

Title: Controller module failed

Failure: Memory chip socket failed
Failure Type: Random
Primary Failure Stress Source: Vibration
Secondary Failure Stress Source: Temperature

Environmental failure stress sources are external to the product. Chemical,

electrical, mechanical, and physical processes may all be present. The fail-

Copyright International Society of Automation

ure rates caused by environmental sources vary with the product design.
Some products are stronger and are designed to withstand higher envi-
ronmental stress; they do not fail as frequently.

For example, failures in electronic equipment have been traced to corro-

sion within a module. Electrical contacts can no longer conduct current
because insulating products of corrosion have formed, and often, the con-
tacts themselves are weakened. Stronger designs that utilize contact lubri-
cants (a protective sealant layer around contacts) can withstand this
chemical process for much longer periods of time.

Module input circuits are susceptible to electrical overstress because they

are typically connected via cable to field sensors. Cables often pass
through areas of high electrical stress in the form of current-induced mag-
netic fields. Induced voltage spikes can easily exceed the ratings of input
amplifiers. Unless these failure stress sources are thwarted, they will cause
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

failures. A stronger design may utilize special protective components to

absorb energy, thereby limiting voltages and currents.

Mechanical vibration can fracture metal components, especially at reso-

nant frequencies. Stronger designs use bracing to withstand the stress of
this failure stress source, or include measures that reduce vibration.

Temperature and humidity are environmental failure stress sources that

affect failure rates in two ways:

1. They can cause direct failure when their magnitude exceeds design
limits. For example, an integrated circuit mounted in a plastic
package works well even when hot (some are rated to junction
temperatures of 150C). But when the temperature exceeds the
glass transition point of the plastic (approximately 180C), the
component fails. In another example, the keyboard for a personal
computer worked well in a control room even through the humid
days of summer. But it failed immediately when a glass of water
tipped into it. For this failure:
Title: Operator console failed
Failure: Keyboard failedwater spill
Failure Type: Random
Primary Failure Stress Source: Water on keyboard
Secondary Failure Stress Source: None identified

2. Temperature and humidity can also accelerate other failure stress

sources. Many chemical and physical processes operate as a func-
tion of temperature. The hotter it gets, the faster the process pro-
ceeds. Consider another case in which a temperature controller

Copyright International Society of Automation

was mounted near a glass furnace with an average ambient tem-

perature of 70C. It failed after one year of operation with its out-
puts at full scale. The failure report of the controller showed that an
electrically programmable logic array no longer contained its pro-
gram and did not perform the correct logic functions. The compo-
nent was programmed with an electrical charge that is trapped on
transistor gates within the device. The transistor gate will leak
the electrical charge as a function of temperature. Over a period of
time, the charge will dissipate and the component will fail. The
memory was designed to hold charge for 25 years when operated
at 50C. At the higher temperature, the charge leakage is acceler-
ated. The expected life is drastically reduced. For this failure:
Title: Temperature controller failed on
Failure: Logic component lost program
Failure Type: Random
Primary Failure Stress Source: Temperature
Secondary Failure Stress Source: None identified

Humidity also affects other failure stress sources. As mentioned, the

presence of humidity may accelerate corrosion. Humidity may also
provide a conduction path that allows external electrostatic discharge to
cause damage.

External Maintenance Faults

Mistakes made during the repair process can cause failures. For example,
a controller module in a dual fault tolerant system fails. The system con-
tinues to operate in a degraded mode because its redundant backup mod-
ule is now controlling. While installing the replacement controller module,
the technician inadvertently plugs the output switch cable into the com-
munication socket, shorting the system output. For this failure:

Title: Redundant controller fails

Failure: Maintenance technician put cable into wrong socket
Failure Type: Random
Primary Failure Stress Source: Maintenance errorcable
Secondary Failure Stress Source: None identified
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Maintenance activities performed by humans are affected by many vari-

ables: complexity, system repairability, condition of maintenance techni-
cian (fatigue, distraction, etc.), training, familiarity with the system, and
time pressure. As with operational failure rates, maintenance failure rates
are hard to estimate. Failure rates should be estimated on a site basis by
reviewing historical plant maintenance logs. The factors cited above

Copyright International Society of Automation

should be taken into account when calculating failure rates and used as
adjustment factors based on experience.

External Operational Faults

Any system that allows a human operator to command system operation
can fail due to an incorrect command. Operator errors are listed in acci-
dent reports in many industries. Yet some say there is no such thing as an
operator error. In his book Normal Accidents Living with High-Risk
Technologies [Ref. 10] Charles Perrow states that complex, often coupled
systems are the real cause of the problem. Certainly, operator errors are
affected by system complexity, design of the human-machine interface,
familiarity with the system, operator condition (fatigue, boredom, etc.),
and reliability of the instrumentation. Operator errors can even be
induced by production pressures. But sometimes the operator randomly
does the wrong thing.

The many factors involved in operator errors make it hard to estimate

operator error rates. Theoretical models for quantitative analysis do exist
[Ref. 14]. Error rate estimates can be done on a plant site basis by review-
ing historical plant failure logs. Error rate is calculated by dividing the
number of operator-caused failures by the operating hours. Relative sys-
tem complexity can be estimated and used as an adjustment factor.

Stress and Strength

Failures occur when some form of stress exceeds the associated strength of
the product. Dr. A. C. Brombacher explains this in detail in his book Reli-
ability by Design CAE Techniques for Electronic Components and Sys-
tems [Ref. 15]. The concept is easy to understand and is used extensively
in mechanical engineering when choosing the size and type of material to
be used for components. The mechanical engineer is dealing with stress
consisting of physical force. The associated mechanical strength is the abil-
ity of the component to resist the force.

In the safety and reliability analysis of control systems, many other types
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

of stress are present, as discussed above. Brombacher explains that stress

is composed of a combination of stressors, including chemical corrosion,
electrical voltage/current, mechanical vibration/shock, temperature,
humidity, and even human error. All external failure stress sources are
stressors.

The strength of a particular product (component, module, unit or system)

is its ability to resist a given stressor. Strength is a result of its design and
manufacture. The choice of various design parameters and protection

Copyright International Society of Automation

measures dictates the initial strength of a product. If the manufacturing

process identically duplicates the product, the strength of each device will
be the same, as designed. Defects in raw materials and faults in the manu-
facturing process will reduce strength in a variable way. Internal failure
stress sources affect strength.

Stress
Stress varies with time. Consider, for example, stress caused by ambient
temperature. Ambient temperature cycles every day when the earth
rotates. Temperature levels change with the season and other random
events. When temperature is viewed as a whole, the stress level can be
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

characterized as random.

Many independent parameters affect temperature. Each of these sources

will affect temperature in some variable manner that is characterized by a
probability density function (PDF). Given a large number of independent
sources, the central limit theorem would tell that a probability density
function for ambient temperature is a normal distribution. The probability
of any particular temperature level range depends on the normal distribu-
tion characterized by its mean and its standard deviation (Figure 3-5).
Probability

A Stress Variable
Figure 3-5. Normally Distributed Stress

When evaluating stress levels that cause failures, the probability of getting
a stress level lower than a particular value must be determined. This is
represented by an inverse distribution function as shown in Figure 3-6.

Strength
As mentioned above, the strength of a product is a measure of the ability
of the product to resist a failure stress source (stressor). Strength is the

Copyright International Society of Automation

Probability

A Stress Variable
Figure 3-6. Probability of Getting a Stress Level Lower Than a Particular Value

result of a products design and manufacture. The product is designed to

meet a certain strength level over its predicted life. The manufacturing
process is designed to replicate the product. This applies from the compo-
nent level on up. If the manufacturing process is perfect, the strength will
always be the same. This is illustrated in Figure 3-7. The probability of get-
ting a stress less than a particular value (inverse stress) is plotted along
with a step function which shows a CDF for strength. The step occurs at
the known strength of the product. There is an area on the figure that rep-
resents the probability of failurestress is greater than strength. This area

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
is marked with a crosshatch.

1
Probability

A Stress Variable
Figure 3-7. Strength Versus Inverse Stress CDF

In some cases the area representing probability of failure can be calcu-

lated. Remember that a particular product failure occurs when any one of
a number of stress levels exceeds the associated strength level. Stress level

Copyright International Society of Automation

is defined as the variable x, and strength level, variable y. The excess of

strength over stress is represented by the variable w:

w = yx (3-1)

The product succeeds when w > 0 and fails when w 0. The failure proba-
bility is represented by the area where the stress-strength curves interfere.
This area is reduced when more safety factor (the difference between
mean strength and mean stress) is used by the designer.

EXAMPLE 3-1

Problem: A module will fail when the temperature of its power

transistor exceeds the transistors strength of 90C. The power
transistor operates at a temperature that is 30C hotter than the
ambient temperature. The ambient temperature in a large number of
installations over a long period of time is characterized by a normal
distribution with a mean of 40C and a standard deviation of 10C.
What is the probability of module failure?

Solution: How much of the area under the normal curve that
represents stress (ambient temperature) exceeds the strength (90C)
of the transistor? First, relate the transistor strength to ambient
temperature. Since the transistor operates 30 degrees hotter than
ambient,

90 - 30 = 60 = maximum ambient temperature

The normal distribution is converted to standard form using Equation

2-20,
60 40
z = ------------------- = 2
10
Using Appendix A at z = 2, a number of 0.97725 is obtained. This
means that 97.725% of the area under this normal distribution is
below a temperature of 60C. The area beyond the 60C point is

100% 97.725% = 2.275%

The probability of module failure due to excessive temperature stress

is 0.02275.

A simulation can show how many modules will fail as a function of time
given a particular stress-strength relationship. A simulation was done
[Ref. 16] using a normally distributed stress and a constant strength.
Choosing values (Figure 3-8) similar to previous work done at Eindhoven
University in the Netherlands [Ref. 15], the results are shown in Figure 3-
9. Within the limits of the simulation, a relatively constant number of
modules would fail as a function of time.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

0.9

0.8
Probability
0.7 Stress: Mean = 10
0.6 Std. Dev. = 1
0.5

0.4
Strength: 12.7
0.3

0.2

0.1

0
10 11 12 13
A Stress Variable
Figure 3-8. Stress Versus Strength Simulation Values Identical Strength

0.006

0.005
Failure rate

0.004

0.003

0.002

0.001

0
101

201

301

401

501

601

701

801

901
1

Operation Time Interval

Figure 3-9. Failure Rate as a Function of Operation Time Interval Identical Strength

Strength Varies
Actual manufacturing processes are not ideal, and newly manufactured
products are not identical in strength. In addition to the effects of manu-
facturing defects, the raw materials used to make components vary from
batch to batch, the component manufacturing process varies, and the
module level manufacturing process can differ in any number of places.
The result of this variation in strength can be characterized by a probabil-
ity density distribution. Experience, supported by the central limit theo-
rem, would suggest another normal distribution.

Figure 3-10 shows this stress-strength relationship. Again, the probability

of failure can be calculated from a stress-strength relationship, although in

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

this case numerical integration techniques must be used. A simulation of

this stress-strength relationship shows interesting results (Figure 3-11).
Fewer and fewer products fail with time. This occurs because the weaker
units fail quickly and are removed from the population. Remaining units
are stronger and last longer.

0.9

0.8
Probability

0.7
Strength:
0.6 Stress: Mean = 10.0 Mean = 12.7
0.5
Std.Dev. = 1.0 Std. Dev. = 0.5
0.4

0.3

0.2

0.1

A Stress Variable

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 3-10. Stress Versus Strength Simulation Values Normal Strength

0.008
0.007
0.006
Failure rate

0.005
0.004
0.003
0.002
0.001
0
101

201

301

401

501

601

701
1

801

901

Operating Time Interval

Figure 3-11. Failure Rate as a Function of Operating Time Interval Normal Strength

Strength Decreases
Strength changes with time. Although there are rare circumstances where
product strength increases with time, a vast majority of strength factors
decrease. Even in the absence of changes in stress, as strength decreases,
the likelihood of failure increases, and the rate at which failures occur will

Copyright International Society of Automation

increase. Decreasing strength is shown in Figure 3-12. The failure rate is

represented by area bounded by the stress and strength curves and it can
be seen that this area grows larger as strength decreases. This increase in
failure rate is an unfortunate reality known as wearout.

0.9
Starting Strength:
0.8
Mean=12.7
Probability

0.7
Std.Dev.=0.5 @ t = 0.
0.6 Stress: Mean=10.0
0.5
Std.Dev.=1.0
0.4

0.3
Strength @ t = 800.
0.2

0.1

A Stress Variable
Figure 3-12. Stress Strength with Decreasing Strength

0.025

0.02
Failure rate

0.015

0.01

0.005

0
201

301

401

501

601

701

801
1

101

Operating Time Interval

Figure 3-13. Failure Rate with Time When Strength Decreases

Measuring Strength
Strength is measured by testing a product (a component, a module, a unit,
or a system) until it fails (or at least until the limits of the test equipment
have been reached). When several identical devices are actually tested to
destruction during a design qualification test, a good indication of
strength is obtained. Many industrial stressors have been characterized by
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

international standards. These include temperature, humidity, corrosive

atmospheres, electromagnetic radiation, electrical surge, electrostatic dis-
charge, shock, and vibration. Each stressor is defined, typically by lev-
els, in a standard. A set of stress levels, commonly used for design
qualification testing of industrial products, is shown in Table 3-2.

Table 3-2. Recommended Environmental Stress Levels for Control Room

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Temperature
Temperature stress testing is defined by the IEC 60068-2 standard [Ref.
17]. Different temperature levels for operation and storage are defined. A
series of tests under various conditions are defined. The most common
tests for operational temperature stress are 68-2-1 Test Ab, 68-2-2 Test Bb
and 68-2-14 Tests Na and Nb. A typical industrial specification might read
Operating Temperature: 0 60C, tested per IEC 60068-2-2.

Although the test standards do not require operation beyond the test lim-
its, testing to destruction should be done during design qualification to
ensure that the measured strength has a good margin. A high strength
industrial controller should successfully operate at least 30C beyond the
specification. High temperature-related strength is designed into industrial
electronic modules by adding large metal heat sinks and heat sink blocks to
critical components like microprocessors and power output drivers.

Humidity
Humidity testing is also done per the IEC 60068-2 series of standards. The
most common testing is done using 68-2-3 Test Ca for operational humid-
ity and 68-2-30 Test Dd for cycling humidity. Common specifications for
industrial devices are 5 95% non-condensing for operation and 0 100%
condensing for storage. Control modules with plastic spray coating and
contact lubricant will resist humidity to much higher stress levels.

Mechanical Shock/Vibration
Mechanical shock and vibration stressors can be quite destructive to con-
trol systems, especially when they are installed near rotating equipment
like pumps and compressors. Testing is typically done by vibrating the
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

product under test at various frequencies between 10 and 150 Hz per IEC
68-2-6 Test Fc. Although displacement and acceleration are specified for
different portions of the frequency spectrum, it is useful to think of the
stressor levels in Gs. IEC 68-2-27 Test Ea defines how mechanical shocks
are applied.

A typical specification for industrial control room equipment is 2G vibra-

tion and 15G shock. High strength equipment may be rated to 4G vibra-
tion and 30G shock. Control systems that feature bolt-down connectors,
screw-in modules, rigid printed circuit mounting and mechanical bracing
will operate successfully at higher mechanical shock and vibration stress
levels.

Copyright International Society of Automation

Corrosive Atmospheres
Failures due to corrosive atmospheres are common in control systems,
especially in the chemical, oil and gas, and paper industries. Some of the
corrosive gases attack the copper in electronic assemblies. The ISA-71.04
standard [Ref. 18] specifies several levels of corrosion stress. Testing can
be done to verify the ability of a system to withstand corrosive atmo-
spheres over a period of time.

High strength systems should withstand a class G3 atmosphere for over

10 years. Plastic coating of electric assemblies, contact lubricant, and a
physical cover over electronic modules will increase strength against cor-
rosion stressors.

Electromagnetic Fields

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The IEC 61000-4 series of standards [Ref. 19] specify levels of various elec-
tromagnetic stressors. IEC 61000-4-3 specifies test methods for radiated
electromagnetic fields (EMF) and IEC 61000-4-6 specifies test methods for
conducted electromagnetic fields. Industrial control equipment is typi-
cally specified to withstand radiated EMF (sometimes called radio fre-
quency interferenceRFI) at levels of 10 V/m over frequencies between
80 MHz and 1000 MHz. There should be enough strength to withstand 10
V conducted disturbances over a 150-kHz to 80-Mhz range. Module
shielding and EMF filtering will add strength to a module.

Electrostatic Discharge
IEC 61000-4-2 explains how to determine strength against electrostatic dis-
charge. An electrostatic discharge can occur when two dissimilar materi-
als are rubbed together. When an operator gets out of a chair or a person
removes a coat, an electrostatic discharge may result. Equipment that will
withstand a 15 kV air discharge or an 8 kV contact discharge is suited for
use in an industrial environment. High strength can be provided by
grounded conductive (metal) enclosures that provide a safe path for elec-
trostatic discharge.

Electrical Surge and Fast Transients

Electrical stress in the form of voltage surge and transient voltages and
currents can come from secondary effects of lightning, motor starters,
relay contacts, and even arcing contacts (welding machines). These surge
levels are specified in IEC 61000-4-5. Electrical fast transients are specified
in IEC 61000-4-4. Surge and fast transient levels specified for power lines
are higher than for signal lines as higher levels of voltage surge are more

Copyright International Society of Automation

likely on power systems. Surge protection networks and inductive filters

on signal cables offer high strength against these stressors.

Exercises
3.1 What failure stress source has caused the most failures in your
plant?
3.2 Can software design faults cause failures of computer-based con-
trollers?
3.3 What failure stress sources are primarily responsible for infant
mortality failures?
3.4 What method can be used to cause manufacturing defects to
become failures in a newly manufactured product?
3.5 How is a wearout mechanism related to strength of a product?
3.6 The input module to a process control system will withstand a
maximum of 2000 volts transient electrical overstress without fail-
ure. Input voltage transients for many installations over a period of
time are characterized by a normal distribution with a mean of
1500 volts and a standard deviation of 200 volts. What is the proba-
bility of module failure due to input voltage transient electrical
overstress?
3.7 A stress is characterized as having a normal distribution with a
mean of 10 and a standard deviation of 1. If devices will fail when
subject to stresses above 12.7, what is the probability of failure?
3.8 Contact lubricant is a fluid used to seal electrical connectors and
provide strength against what stressors?

Answers to Exercises
3.1 The answer depends on a particular plant site.
3.2 Yes, software design faults can cause control systems to fail. This
risk may go up as the quantity and complexity of software in our
systems increases.
3.3 Manufacturing defects.
3.4 Stress screening during manufacture, burn-in.
3.5 Wearout occurs when the strength of a device decreases with time.
3.6 Using a standard normal distribution with x = 2000, = 1500 and
= 200, z = 2.50. From Appendix A, probability of failure =
1 0.993791 = 0.006209.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

3.7 Using a standard normal distribution with z = 2.7, probability of

failure = 1 0.996533 = 0.003467.
3.8 Contact lubricant seals electronic connectors against humidity and
corrosive atmospheres. It would also provide strength against dirt
and dust.

References
1. IEC 61508-2000 Functional Safety of Electrical/Electronic/Programma-
ble Electronic Safety-related Systems. Geneva: International Electro-
technical Commission, 2000.

2. Safety Equipment Reliability Handbook, 3rd Edition. Sellersville: exida,

2008.

3. MIL-HNBK-217F Reliability Prediction of Electronic Equipment NY:

Rome Laboratory, AFSC, 1985.

4. Guidelines for Process Equipment Reliability Data, with Data Tables.

New York: AIChE, 1989.

5. IEEE 500, Component Reliability Data, New York: IEEE, 1984.

6. Goble, W. M. All Failures Are Systematic. Hydrocarbon Process-

ing, Houston: Gulf Publishing Company, July 2007.

7. Watson, G. F. Three Little Bits Breed a Big, Bad Bug. IEEE Spec-
trum, New York: IEEE, May 1992.

8. Kletz, T. A. What Went Wrong? Case Histories of Process Plant Disas-

ters. Houston: Gulf Publishing Company, 1988.

9. Kletz, T. A. Still Going Wrong! Case Histories of Process Plant Disas-

ters and How They Could Have Been Avoided. Houston: Gulf Publish-
ing Company, 2003.

10. Perrow, C. Normal Accidents Living With High-Risk Technologies

New York: Basic Books, 1984.

11. Fagan, M. E. Advances with Inspections. IEEE Transactions of

Software Engineering, July 1986.

12. Parnas, D. L. and Weiss, D. M. Active Design Reviews: Principles

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

and Practices. Proceedings of the Eighth International Conference on

Software Engineering. Washington, DC: IEEE Computer Society
Press, 1985.

Copyright International Society of Automation

13. Nelson, W. Accelerated Testing. Hoboken: John Wiley and Sons,

1990.

14. Dougherty, E. M. Jr. and Fragola, J. R. Human Reliability Analysis.

Hoboken: John Wiley and Sons, 1988.

15. Brombacher, A. C., Reliability By Design CAE Techniques for Elec-

tronic Components and Systems. Hoboken: John Wiley and Sons,
1992.

16. Beurden, I. J. W. R. J. Stress-strength Simulations for Common Cause

Modeling Is Physical Separation a Solution for Common Cause Fail-
ures? Spring House: Moore Products Co., 1997.

17. IEC 60068-2-2 Ed. 4.0 b:1974 Environmental Testing Part 2: Tests.
Tests B: Dry Heat. Geneva: International Electrotechnical Commis-
sion, 2000.

18. ISA-71.04 Environmental Conditions for Process Measurement and

Control Systems: Airborne Contaminants. Research Triangle Park:
ISA, 1985.

19. IEC 61000-4-3 Electromagnetic Compatibility (EMC) Part 4-3: Test-

ing and Measurement Techniques Radiated, Radio-frequency, Electro-
magnetic Field Immunity Test, 3rd Edition. Geneva: International
Electrotechnical Commission, 2006.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Reliability Definitions
As the field of reliability engineering has evolved, several measures have
been defined to express useful parameters that specifically relate to suc-
cessful operation of a device. Based on that work, additional measures
have been more recently defined that specifically relate to safety engineer-
ing. These measures have been defined to give the different kinds of infor-
mation that engineers need to solve a number of different problems.

Time to Failure
The term random variable is well understood in the field of statistics. It
is the independent variablethe variable being studied. Samples of the
random variable are taken. Statistics are computed about that variable in
order to learn how to predict its future behavior.

In reliability engineering, the primary random variable is T: Time to Fail-

ure or Failure Time. Reliability engineers gather data about when (and,
using other measures, how) things fail. This information is used to gain
insight into the future performance of system designs. For example, 10
modules are life tested and each modules time to failure is recorded
(Table 4-1). In this study, T is the random variable of interest.

The sample average (or mean) of the failure times can be calculated. For
this set of test data, the sample mean time to failure, MTTF, is calculated to
be 3,248 hours. This measurement provides some information about the
future performance of similar modules.

59
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
60 Control Systems Safety Evaluation and Reliability

Table 4-1. Ten Module Life Test Results

Module Time to Fail
1 2327 Hrs.
2 4016 Hrs.
3 4521 Hrs.
4 3176 Hrs.
5 0070 Hrs.
6 3842 Hrs.
7 3154 Hrs.
8 2017 Hrs.
9 5143 Hrs.
10 4215 Hrs.

Reliability
Reliability is a measure of success. It is defined as the probability that a
device will be successful; that is, that it will satisfactorily perform its
intended function when required to do so if operated within its specified
design limits. The definition includes four important aspects of reliability:

1. The devices intended function must be known.

2. When the device is required to function must be established.

3. Satisfactory performance must be determined.

4. The specified design limits must be known.

All four aspects must be addressed when measuring the reliability of a

device.

Mathematically, reliability (R) has a precise definition: The probability

that a device will be successful during the operating time interval, t. In
terms of the random variable T,

R(t) = P(T > t) (4-1)

Reliability equals the probability that T, failure time, is greater than t,

operating time interval (Ref. 1).

Consider a newly manufactured and successfully tested component. It

operates properly when put into service at time t = 0. As the operating
time interval increases, it becomes less likely that the component will con-
tinue to operate properly. Since the component will eventually fail, the
probability of success for an infinite operating time interval is zero. R(t) is
a cumulative distribution function. It begins at a probability of one and
decreases to a probability of zero (Figure 4-1).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Probability

OPERATING TIME INTERVAL

Figure 4-1. Reliability Functions

Reliability is a function of failure probability and operating time interval.

A statement such as System reliability is 0.95 is meaningless because the
time interval is not known. The statement The reliability equals 0.98 for a
mission time of 100 hours makes perfect sense. Reliability is a measure
that is usually applied to situations such as aircraft flights and space mis-
sions where no repair is possible. In these circumstances a system must
operate continuously without any failure to achieve mission success.

Unreliability
Unreliability, F(t), a measure of failure, is defined as the probability that a
device will fail during the operating time interval, t. In terms of the ran-
dom variable T,

F( t) = P(T t) (4-2)

Unreliability equals the probability that failure time will be less than or
equal to the operating time interval. Since any device must be either suc-
cessful or failed, F(t) is the ones complement of R(t).

F(t) = 1 R(t) (4-3)

F(t) is also a cumulative distribution function. It begins with a probability

of zero and increases to a probability of one.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Availability
Reliability is a measure that requires success (that is, successful operation)
for an entire time interval. No failures (and subsequent repairs) are
allowed. This measurement was not enough for engineers who needed to
know the chance of success when repairs may be made.

Another measure of success was required for repairable devices (Figure 4-

2). That measure is availability. Availability is defined as the probability
that a device is successful (operational) at any moment in time. There is
no operational time interval involved. If a device is operational, it is avail-
able. It does not matter whether it has failed in the past and has been
repaired or has been operating continuously without any failures. There-
fore availability is a function of failure probabilities and repair probabili-
ties whereas reliability is a function of failure probabilities and operating
time interval.

RELIABILITY

AVAILABILITY
UNRELIABILITY

UNAVAILABILITY

SUCCESSFUL OPERATION UNSUCCESSFUL

OPERATION

Figure 4-2. Successful Unsuccessful Operation

Unavailability
Unavailability, a measure of failure, is also used for repairable devices. It
is defined as the probability that a device is not successful (is failed) at
any moment in time. Unavailability is the ones complement of availabil-
ity; therefore,
U( t) = 1 A(t) (4-4)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 4-1
Problem: A controller has an availability of 0.99. What is the
unavailability?
Solution: Using Equation 4-4, unavailability = 1 - 0.99 = 0.01.

Probability of Failure
The probability of failure during any interval of operating time is given by
a probability density function (See Chapter 2). Probability density func-
tions for failure probability are defined as

dF ( t )
f ( t ) = ------------- (4-5)
dt

The probability of failure function can be mathematically described in

terms of the random variable T:

lim P ( t < T t + t ) (4-6)

t 0

This can be interpreted as the probability that the failure time, T, will
occur between a point in time t and the next interval of operation, t + t,
and is called the probability of failure function.

The probability of failure function can provide failure probabilities for any
time interval. The probability of failure between the operating hours of
2000 and 2200, for example, is:

2200
P ( 2000, 2200 ) = f ( t ) dt (4-7)
2000

Mean Time to Failure (MTTF)

One of the most widely used reliability parameters is the MTTF. Unfortu-
nately, it is also sometimes misused and misunderstood. It has been misin-
terpreted as guaranteed minimum life. Consider Table 4-1. The MTTF of
3,248 hours was calculated using a simple averaging technique. One of the
modules failed after only 70 hours, however.

MTTF is merely the mean or expected failure time. It is defined from the
statistical definition of expected or mean value (Equation 2-10). Using
the random variable operating time interval, t, and recognizing that there

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 4-2

Problem: A valve positioner has an exponential probability of failure

function (constant failure rate)
f(t) = 0.0002e 0.0002t

What is the probability that it will fail after the warranty (6 months,
4,380 hr) and before plant shutdown (12 months, 8760 hr)?

Solution: Using Equation 4-7,

8760

P ( 4380, 8760 ) = 0.0002e

0.0002t
dt
4380

This evaluates to
0.0002 8760 0.0002 4380
P ( 4380, 8760 ) = e ( e )
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Calculating the result:

P ( 4380, 8760 ) = 0.1734 + 0.4165 = 0.243

This result states that the probability of failure during the interval from
4380 hours to 8760 hours is 24.3%.

is no negative time we can update the mean value equation and substitute
the probability density function f(t):

+
E( t) = tf ( t ) dt (4-8)
0

Substituting

d[R(t)]
f ( t ) = ------------------- (4-9)
dt

into the expected value formula yields

+
E( t) = t d[ R ( t ) ]
0

Copyright International Society of Automation

Integrating by parts, this equals:

+

E(T) = [ tR ( t ) ] 0 R ( t ) dt
0

The first term equals zero at both limits. This leaves the second term,
which equals MTTF:

+
MTTF = E ( T ) = R ( t ) dt (4-10)
0

Thus, in reliability theory, the definition of MTTF is the definite integral

evaluation of the reliability function. Note that the definition of MTTF is
NOT related to the inverse of (failure rate), that is only a special case
derived for a constant failure rate as noted below.

1
MTTF --- by definition (4-11)

NOTE: The formula MTTF = 1/ is valid for single components with a con-
stant failure rate or a series of components, all with constant failure rates.
See The Constant Failure Rate later in this chapter.

Mean Time to Restore (MTTR)

MTTR is the expected value or mean of the random variable restore
time (or time to restore a failed device to full operation), not failure time.
The definition includes the time required to detect that a failure has
occurred and to identify it as well as the time required to make the repair.
The term MTTR applies only to repairable devices. Figure 4-3 shows that
MTTF represents the average time required to move from successful oper-
ation to unsuccessful operation. MTTR is the average time required to
move from unsuccessful operation to successful operation. The term Mean
Dead Time (MDT) is an older term which means the same as MTTR.

Mean Time Between Failures (MTBF)

MTBF is a term that applies only to repairable systems. Like MTTF and
MTTR, it is an average value, but it is the time between failures. This
implies that a device has failed and then has been repaired. For a repair-
able device,

MTBF = MTTF + MTTR (4-12)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

RELIABILITY

AVAILABILITY
UNRELIABILITY
UNAVAILABILITY

SUCCESSFUL OPERATION UNSUCCESSFUL

OPERATION

MTTF
MTTR

Figure 4-3. MTTF and MTTR in Operation

Figure 4-4 shows a graphical representation of MTTF, MTTR, and MTBF.

Successful Operation

MTTF MTTR
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

MTBF

Failure

TIME
Figure 4-4. MTTF, MTTR, and MTBF

The term MTBF has been misused. Since MTTR is usually much smaller
than MTTF, MTBF is often approximately equal to MTTF. MTBF, which by
definition only applies to repairable systems, is often substituted for
MTTF, which applies to both repairable and non-repairable systems.

Copyright International Society of Automation

EXAMPLE 4-3

Problem: A power supply module is potted in epoxy. The

manufacturer quotes MTBF equals 50,000 hours. Is this term being
used correctly?

Solution: This power supply is potted in epoxy. Although it can be

argued that it is possible to repair the unit, it is generally not practical.
The term MTBF applies to equipment that is designed to be
repairable, and has been misused in this case.

EXAMPLE 4-4

Problem: An industrial I/O module contains 16 circuit assemblies

mounted in sockets. The manufacturer claims that the module has an
MTBF of 20,000 hours. Is the term MTBF being used correctly?

Solution: The module has been designed to be repairable. MTBF is

appropriate although MTTF is preferable since average time to failure
is the variable of interest. The manufacturer has no direct control over
MTTR.

EXAMPLE 4-5

Problem: An industrial I/O module has an MTTF of 87,600 hours.

When the module fails, it takes an average of 2 hours to repair. What
is the MTBF?

Solution: Using formula 4-12, the MTBF = 87,602 hours. The MTBF
is effectively equal to the MTTF.

EXAMPLE 4-6

Problem: An industrial I/O module has an MTTF of 87,400 hours.

When the module fails, it takes an average of 400 hours to repair.
What is the MTBF?

Solution: Using formula 4-12, the MTBF = 87,800 hours. It is

interesting to note that compared to the module of example 4-5, this
module will fail sooner (it has a lower MTTF). Using the MTBF
number as a positive indicator would be misleading. MTTF is a more
precise term than MTBF for the measurement of successful
operation.

Copyright International Society of Automation

--``,,`,,,`,,`,`,

Failure Rate
Failure rate, often called hazard rate by reliability engineers, is a com-
monly used measure of reliability that gives the number of failures per
unit time from a quantity of components exposed to failure.

Failures per Unit Time

( t ) = ------------------------------------------------------ (4-13)
Quantity Exposed

Failure rate has units of inverse time. It is a common practice to use units
of failures per billion (109) hours. This unit is known as FIT for Failure
unIT. For example, a particular integrated circuit will experience seven
failures per billion operating hours at 25C and thus has a failure rate of
seven FITs.

Note that the measure failure rate is most commonly attributed to a sin-
gle component. Although the term can be correctly applied to a module,
unit, or even system where all components are needed to operate (called a
series system), it is a measure derived in the context of a single component.

EXAMPLE 4-7

Problem: 300 power resistors have been operating in a plant for

seven years. Five failures have occurred. What is the average failure
rate for this group of resistors?

Solution: Using formula 4-13, the average failure rate is 5/(295 7

8760) = 0.000000271798 = 276 FITs. Note that a worst case
assumption was used where all five failures occurred at the beginning
of the seven years. If exact failure times were available, a more
accurate number could be calculated.

The failure rate function is related to the other reliability functions. It can
be shown that

f(t)
( t ) = ----------- (4-14)
R(t)

Time-Dependent Failure Rates

If a collection of 50 transformers were put on life test, the time when any
transformer fails could be recorded. An extremely stressful environment
could be created to accelerate failures. A check to see how many trans-
formers fail each week may show whether the percentage of failures
decreases, increases, or stays constant.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Consider the failure log for a highly accelerated life test (HALT) as shown
in Table 4-2. The number of failures is decreasing during the first few
weeks. The number then remains relatively constant for many weeks.
Toward the end of the test the number begins to increase.

Failure rate is calculated in column four and equals the number of fail-
ures divided by the number of module hours (surviving modules times
hours) in each weekly period. The failure rate also decreases at first, then
remains relatively constant, and finally increases. These changes in the
failure rates of components are typically due to several factors including
variations in strength and strength degradation with time. Note that, in
such a test, the type and level of stress do not change.

Table 4-2. Life Test Data

Number Surviving
WEEK Failures Failure Rate
(beg. of week)
1 50 9 0.0011
2 41 5 0.0007
3 36 3 0.0005
4 33 2 0.0004
5 31 2 0.0004
6 29 1 0.0002

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
7 28 2 0.0004
8 26 1 0.0002
9 25 1 0.0002
10 24 0 0.0000
11 24 2 0.0005
12 22 1 0.0002
13 21 1 0.0003
14 20 0 0.0000
15 20 1 0.0003
16 19 0 0.0000
17 19 1 0.0003
18 18 1 0.0003
19 17 0 0.0000
20 17 1 0.0003
21 16 1 0.0004
22 15 0 0.0000
23 15 1 0.0004
24 14 0 0.0000
25 14 1 0.0004
26 13 0 0.0000
27 13 1 0.0005
28 12 0 0.0000
29 12 1 0.0005
30 11 0 0.0000
31 11 1 0.0005
32 10 1 0.0006
33 9 1 0.0007
34 8 1 0.0007
35 7 1 0.0008
36 6 1 0.0010
37 5 2 0.0024
38 3 3 0.0059

Copyright International Society of Automation

Decreasing Failure Rate

A decreasing failure rate is characteristic of a fault removal process.
Consider a collection of components in which a portion of the components
have manufacturing defects. The entire collection of components is placed
on an accelerated life test. Since manufacturing defects reduce the strength
of a component, the components with defects will fail in a relatively short
period of time. Failed components are removed from the test. After a
period of time, no components that have manufacturing defects remain in
the collection. The failure rate due to manufacturing defects will have
dropped to zero (Figure 4-5).
Failure Rate

TIME
Figure 4-5. Decreasing Failure Rate

Constant Failure Rate

If failures in a large collection of components are due to uniform stresses
from the environment and the strength is relatively constant, the failure
rate for a large collection tends to be constant (refer back to Figure 3-10,
which shows an approximately constant failure rate). This tends to hap-
pen because many stresses of many different types appear to behave like a
uniform stress.

Increasing Failure Rate Wear-out

The process of wear can be thought of as a gradual reduction in strength;
eventually the strength drops below that required for normal use.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Some components have consumable resources. When the resource is gone,

the component is worn out. A battery is an obvious example of a compo-
nent with a consumable resource. A motor bearing is another example. In
the case of the battery, chemical wearout occurs. In the case of the motor
bearing, a mechanical wearout occurs. Consumption (or wear) occurs in
these two components primarily as a function of use. Other components
have wearout mechanisms that are independent of use. Electrolyte within
an electrolytic capacitor will evaporate. This wearout process occurs even
if the component is not being used (although at a slower rate).

Components wear at different rates. Imagine a life test of 100 fan motors in
which the motor bearings wore at exactly the same rate; one day all the
motors would fail at the same instant. Because components do not wear at
the same rate, they do not fail at the same time. However, as a group of
components approach wear-out, the failure rate increases.

Reductions in strength (wear) occur in other ways. Some failures are

caused only by repeated stress events. Each stress event lowers compo-
nent strength. Electrostatic discharge (ESD) is an example of this. Each
time a component receives an ESD strike, it is damaged, and its strength is
reduced. After some number of ESD hits, component strength drops
below that required for normal operation and the component fails. Any
strength reduction process will result in an increasing failure rate.

Bathtub Curve
As we have seen, a group of components will likely be exposed to many
kinds of environmental stress: chemical, mechanical, electrical, and physi-
cal. The strength factors as initially manufactured will vary and strength
will change at different rates as a function of time. When the failure rates
due to these different failure sources are superimposed, the well-known
bathtub curve results.

The failure rate of a group of components decreases in early life. The left
part of the curve has been called the roller coaster curve (Ref. 2, 3). The
failure rate will be relatively constant after the components containing
manufacturing defects are removed. This failure rate can be very low if the
components have few design faults and high strength. As physical
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

resources on the components are consumed or if other strength reduction

factors come into play, the failure rate increases (sometimes rapidly), com-
pleting the right side of the bathtub (Figure 4-6).

Many variations of the bathtub curve exist. In some cases, no wearout

region exists. Some components rarely have manufacturing defects that
are not detected during the manufacturing test process, and these compo-
nents do not have a decreasing failure rate region.

Copyright International Society of Automation

Infant Mortality Wearout

Failure Rate

Useful Life

Operating Time Interval

Figure 4-6. Bathtub Curve

The Constant Failure Rate

A useful probability density function in the field of reliability engineering
is the exponential. For this distribution:

f(t) = et (4-15)

By integrating this function it can be determined that

F(t) = 1 et (4-16)

and
R(t) = et (4-17)

The failure rate equals

t
f(t) e
( t ) = ----------- = ------------- =
R(t) e
t

This says that a collection of components that have an exponentially

decreasing probability of failure will have a constant failure rate. The con-
stant failure rate is very characteristic of many kinds of products. As seen
in the stress-strength simulations, this is characteristic of a relatively con-
stant strength and a random stress. Other products typically exhibit a
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

decreasing failure rate. In these cases, though, the constant failure rate
represents a worst-case assumption and can still be used.

Copyright International Society of Automation

The MTTF of a device with an exponential probability density function

(PDF) can be derived from the definition of MTTF that was presented
earlier:

+
MTTF = R ( t ) dt
0

Substituting the exponential reliability function:

+
t
MTTF = e dt
0

and integrating,
1 t
MTTF = --- [ e ] 0

When the exponential is evaluated, the value at t = infinity is zero and the
value at t = 0 is one. Substituting these results, we have a solution:

1 1
MTTF = --- [ 0 1 ] = --- (4-18)

Equation 4-18 is valid for single components with an exponential PDF or a

series system (a system where all components are required for successful
operation) composed of components, all of which have an exponential
PDF.

EXAMPLE 4-8

Problem: A motor has a constant failure rate of 150 FITs. What is the
motor reliability for a mission time of 1000 hours?

Solution: Using Equation 4-17, values are substituted including the

failure rate of 150/1,000,000,000 and the time interval of 1000.
0.00000015 1000
R ( 1000 ) = e
0.00015
= e
= 0.99985

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 4-9

Problem: Calculate the MTTF of the motor from Example 4-8.

Solution: Since the motor has an exponential PDF, Equation 4-18

can be used. Therefore:
1
MTTF = -------------------------------
0.00000015
= 6,666,666.7 hr

EXAMPLE 4-10

Problem: Field failure records from a component indicate it has an

average failure rate of 276 FITs. What is the component reliability for
a mission time of 8760 hours (one year)?

Solution: Assuming a constant failure rate using Equation 4-17,

values are substituted including the failure rate of 276/1,000,000,000
and the time interval of 8760.

0.000000276 8760
R(t) = e
= 0.9976

EXAMPLE 4-11

Problem: Calculate the MTTF of the component from Example 4-10.

Solution: Since the component has an exponential PDF, Equation

4-18 can be used. Therefore:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

1
MTTF = ------------------------------- = 3,623,188 hr
0.000000276

A Useful Approximation
Mathematics has shown how certain functions can be approximated by a
series of other functions. One of these approximations can be useful in reli-
ability engineering. For all values of x, it can be shown that

2 3 4
x x x x
e = 1 + x + ----- + ----- + ----- + (4-19)
2! 3! 4!

Copyright International Society of Automation

For a sufficiently small x, the exponential can be approximated:

ex = 1 + x

A rearrangement yields
x
1 e x

Substituting gives the result:

t
1e t

A single component (or series of components) with an exponential PDF

has a failure probability that equals

t
P ( failure ) = 1 e t (4-20)

when t is small.

Thus, this probability can be approximated by substituting the times t

value. This can save engineering time. However, be careful. The approxi-
mation degrades with higher values of failure rates and interval times.
The approximate probability of failure numbers become higher. Remem-
ber, this is not a fundamental formulaonly an approximation.

EXAMPLE 4-12

Problem: A component has an exponential probability of failure. The

failure rate () is 0.00001 failures per hour. We wish to calculate the
probability of failure for a 1000 hour mission. If we use the
approximation, what is the error?

Solution: Using Equation 4-20, the probability of failure can be

approximated.

P(failure) = 0.00001 1000 = 0.01

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

To check the error, the full formula must be used.

F(1000) = 1 - e-01 = 0.009950166.

In this case, the error is 0.00004983375, this is a 0.498% error. Note

that the approximation gives a pessimistic result. This is on the
conservative side when dealing with probabilities of failure, which is
usually a safe approach.

Copyright International Society of Automation

Steady-State Availability Constant Failure Rate

Components
Remember that reliability and availability are different. Availability
defines the chances of success with a repairable component, one that may
or may not be in operating condition when called upon to function. Avail-
ability represents the percentage uptime of a device. Availability over a
long period of time, steady-state availability, is usually the desired mea-
surement. For long-term conditions it is assumed that the restore rate
(1/MTTR) is constant. Restore rate is frequently represented by the lower
case Greek letter (mu). The formula is:

1
= ---------------- (4-21)
MTTR
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

EXAMPLE 4-13

Problem: Diagnostics detect when a failure occurs within three

seconds. The average failure detection and repair time is 4 hours.
Assume a constant restore rate. What is the constant restore rate?

Solution: Using Equation 4-21,

= 1
--- = 0.25
4

For single components with a constant failure rate and a constant restore
rate, steady-state availability can be calculated (see Chapter 8 for more
detail) using the formula:

A = ------------- (4-22)
+

Substituting equations 4-18 and 4-21 into 4-22 yields:

MTTF
A = ----------------------------------------- (4-23)
MTTF + MTTR

This is a common formula for steady-state availability. Note that it has

been derived for a single component with a constant failure rate and a
constant restore rate. It will be shown later in the book that it also applies
to series systems butNOTEit does not apply to many types of systems
including parallel systemssystems with redundancy.

Copyright International Society of Automation

EXAMPLE 4-14

Problem: An industrial motor has an MTTF rating of 25 years.

Assuming a constant failure rate, what is its failure rate?

Solution: Using Equation 4-18, l = 1/MTTF = /(25 8760) =

0.000004566 failures per hour.

EXAMPLE 4-15

Problem: An industrial motor has an MTTF rating of 25 years.

Assuming a constant failure rate, what is the reliability for a one year
period of operation?

Solution: Using Equation 4-17 with = 0.000004566 failures per

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

hour (4566 FITs):

R(t) = e-0.0000045668760 = 0.9607.

The reliability can also be calculated using the approximation

method. Unreliability = t = 0.04 and reliability = 1 0.04 = 0.96. This
is a quicker and more pessimistic answer.

EXAMPLE 4-16

Problem: An industrial motor has an MTTF rating of 25 years.

Assuming a constant failure rate and a constant restore rate of 0.25
(4 hours MTTR), what is the steady-state availability?

Solution: Using Equation 4-23, A = MTTF (MTTF+MTTR) =

0.99998.

EXAMPLE 4-17

Problem: An industrial motor has an MTTF rating of 100 years.

Assuming a constant failure rate and an MTTR of 4 hours, what is the
reliability for a one year period of operation and the steady-state
availability?

Solution: Using Equation 4-18, l = /MTTF = 1/(100 8760) =

0.000001142, (1142 FITs).

Using Equation 4-17 with = 0.000001142 failures per hour:

R(t) = e-0.0000011428760 = 0.99.

Equation 4-23, A = MTTF (MTTF+MTTR) = (100 8760) ((100

8760) + 4) = 0.999995.

Copyright International Society of Automation

Safety Terminology
When evaluating system safety an engineer must examine more than the
probability of successful operation. Failure modes of the system must also
be reviewed. The normal metrics of reliability, availability, and MTTF
only suggest a measure of success. Additional metrics to measure safety
include probability of failure on demand (PFD), average probability of
failure on demand (PFDavg), risk reduction factor (RRF), and mean time
to fail dangerously (MTTFD). Other related terms are probability of failing
safely (PFS), mean time to fail safely (MTTFS), and diagnostic coverage.
These terms are especially useful when combined with the other reliability
engineering terms.

Failure Modes
A failure mode describes the way in which a device fails. Failure modes
must be considered in systems designed for safety protection applications,
called Safety Instrumented Systems (SIS). Two failure modes are impor-
tantsafe and dangerous. ISA standard 84.00.01-2004 (IEC 61511 Mod.)
defines safe state as state of the process when safety is achieved. In
the majority of the most critical applications, designers choose a de-ener-
gized condition as the safe state. Thus a safe failure mode describes any
failure that causes the device to go to the safe state. A device designed for
these safety protection applications should de-energize its outputs to
achieve a safe state. Such a device is called normally energized.

When a normally energized device is operating successfully (Figure 4-7),

the input circuits read the sensors, perform calculation functions, and gen-
erate outputs. Input switches are normally energized to indicate a safe
condition. Output circuits supply energy to a load (usually a valve). The
sensor switch opens (de-energizes) in response to a potentially dangerous
condition. If the logic solver (typically a safety PLC) is programmed to rec-
ognize the sensor input as a potentially dangerous condition it will de-
energize its output(s). This action is designed to mitigate the danger.

A safe failure in such a device (Figure 4-8) happens when the output de-
energizes even though there is no potentially dangerous condition. This is
frequently called a false trip. There are many different reasons that this
can happen. Input circuits can fail in such a way that the logic solver
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

thinks a sensor indicates danger when it does not. The logic solver itself
can miscalculate and command the output to de-energize. Output circuits
can fail open circuit. Many of the components within an SIS can fail in a
mode that will cause the system to fail safely.

Copyright International Society of Automation

For normal + + For normal

operation, operation,
switch is output circuit
closed. is energized.
For abnormal Pressure For abnormal
operation, Sense

PLC
Switch Solid State operation,
switch opens. Output Switch output circuit
de-energizes.
Discrete Input

Solenoid

-
Figure 4-7. Successful Operation Normally Energized System

+ +

Pressure Output Circuit

Sense

PLC
Solid State
Switch Output fails de-
Input Circuit Discrete Input
Switch
energized.
fails de-
energized.
LOAD
Logic Solver reads an
incorrect logic 0 on the input,
solves the logic incorrectly, or -
incorrectly sends a logic 0
and de-energizes the output.

Figure 4-8. Safe Failure Normally Energized System

Dangerous failures are defined as those failures which prevent a device

from responding to a potentially dangerous condition known as a
demand. Figure 4-9 shows this situation.

There are many component failures that might cause dangerous system
failure, especially if a system is not designed for high safety. An IEC 61508
Certified PLC is specifically designed to avoid this failure mode using a
number of design techniques.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Logic solver does not see the

logic 0 input, incorrectly solves
+ the logic or incorrectly sends a +
logic 1, which keeps the output
signal energized.

Pressure Output
Input Sense circuit fails
PLC
Solid State
circuit fails Switch Output Switch
energized.
Discrete Input
energized.

LOAD

-
Figure 4-9. Dangerous Failure Normally Energized System

PFS/PFD/PFDavg
There is a probability that a normally energized SIS will fail with its out-
puts de-energized. This is called probability of failure safely (PFS). There
is also a probability that the system will fail with its outputs energized.
This is called probability of failure on demand (PFD). The term refers to
the fact that when a safety protection system is failed dangerously, it will
NOT respond when a demand occurs. Figure 4-10 shows the relationship
of safe and dangerous failure modes to overall system operation.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

PFS
RELIABILITY

AVAILABILITY
PFD

SUCCESSFUL OPERATION UNSUCCESSFUL

OPERATION

Figure 4-10. Failure Modes

Copyright International Society of Automation

Unavailability was defined as the probability that a device is not success-

ful at any moment in time. Unavailability includes all failure modes, there-
fore for repairable systems:

U ( t ) = PFS ( t ) + PFD ( t ) (4-24)

and availability:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
A ( t ) = 1 [ PFS ( t ) + PFD ( t ) ] (4-25)

This applies for time dependent calculations or steady-state calculations.

EXAMPLE 4-18

Problem: A steady-state PFS value of 0.001 and a steady-state PFD

value of 0.0001 have been calculated for an SIS. What is the
availability?

Solution: Using equation 4-25, the availability = 1-(0.001+0.0001) =

0.9989.

PFD average (PFDavg) is a term used to describe the average probability

of failure on demand. PFDavg is defined as:

T
1
T
PFDavg = --- PFD ( t ) dt (4-26)
0

Since PFD increases with time, the average value over a period of time is
typically calculated numerically (Figure 4-11).

MTTFS/MTTFD
As has been mentioned, MTTF describes the average amount of time until
a device fails. The definition includes all failure modes. In industrial con-
trol systems, the measure of interest is the average operating time between
failures of a particular mode, ignoring all other modes. In the case of an
SIS, the mean time to fail safely (MTTFS) and the mean time to fail danger-
ously (MTTFD) are of interest.

The exact definition of mean time to failure in any mode must be

explained. Consider the failure times of Table 4-3. Assume that one PLC is
being measured. It starts operating at time t = 0 and fails dangerously after
2327 hours. It is then repaired and operates another 4016 hours before it
fails safely. After repair it operates another 4521 hours before failing dan-

Copyright International Society of Automation

Operating Time Interval

Screenshot from exSILentia, used with permission.

Figure 4-11. PFD and PFDavg

gerously again. So far the system has failed dangerously twice. The first
time occurs after operating 2327 hours. The second time occurs after oper-
ating 8537 hours (4016 + 4521). The PLC is again repaired and placed in
service, and the failure times shown in Table 4-3 are recorded.

Table 4-3. PLC System Failure Times

Time to Fail Mode
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Failure 1 2327 Hrs. Dangerous

Failure 2 4016 Hrs. Safe
Failure 3 4521 Hrs. Dangerous
Failure 4 3176 Hrs. Safe
Failure 5 0070 Hrs. Safe
Failure 6 3842 Hrs. Safe
Failure 7 3154 Hrs. Safe
Failure 8 2017 Hrs. Dangerous
Failure 9 5143 Hrs. Safe
Failure 10 4215 Hrs. Dangerous

Copyright International Society of Automation

EXAMPLE 4-19

Problem: A PLC has measured failure data from Table 4-3. What is
the MTTFD?

Solution: Four dangerous failures are recorded. The total operating

times are:

First dangerous failure = 2327 hours.

Second dangerous failure = 8537 hours (4016 + 4521)
Third dangerous failure = 12,259 hours (3176+70+3842+3154+2017)
Fourth dangerous failure = 9358 hours (5143 + 4215)

The average of these four values is 8120 hours. Another way to

calculate the MTTFD would be to note that the total operating hours
is 32481 divided by four dangerous failures.

The MTTFS would be determined in a similar way, using the safe

failures.

Diagnostic Coverage
The ability to detect a failure is an important feature in any control or
safety system. This feature can be used to reduce repair times and to con-
trol operation of several fault tolerant architectures. The measure of this
ability is known as the diagnostic coverage factor, C. The diagnostic cover-
age factor measures the probability that a failure will be detected given
that it occurs. Diagnostic coverage is calculated by adding the failure rates
of detected failures and dividing by the total failure rate, which is the sum
of the individual failure rates. As an example consider a system of ten
components. The failure rates and detection performance are shown in
Table 4-4:

Table 4-4. Diagnostic Performance

Component 1 0.00991 failures per hour Detected
Component 2 0.00001 failures per hour NOT detected
Component 3 0.00001 failures per hour NOT detected
Component 4 0.00001 failures per hour NOT detected
Component 5 0.00001 failures per hour NOT detected
Component 6 0.00001 failures per hour NOT detected
Component 7 0.00001 failures per hour NOT detected
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Component 8 0.00001 failures per hour NOT detected

Component 9 0.00001 failures per hour NOT detected
Component 10 0.00001 failures per hour NOT detected

Copyright International Society of Automation

Although only one component failure out of a possible ten is detected, the
coverage factor is 0.991 (or 99.1%). For this example the detected failure
rate is 0.00991. This number is divided by the total failure rate of 0.01. The
coverage factor is not 0.1 as might be assumed by dividing the number of
detected failures by the total number of known possible failures. Note that
the result would have been quite different if Component 1 was NOT
detected, while Component 2 was detected.

In control system reliability and safety analysis, it is generally necessary to

define the coverage factor for safe failures and the coverage factor for dan-
gerous failures. The superscript S is used for the safe coverage factor, CS.
The superscript D is used for the dangerous coverage factor, CD. The eval-
uation of PFS and PFD will be affected by each different coverage factor.

In some fault tolerant architectures, two additional coverage factor desig-

nations may be required. Detection of component failures is done by two
different techniques, classified as reference and comparison. Reference
diagnostics can be done by a single unit. The coverage factor of reference
diagnostics will vary widely depending on the implementation, with most
results ranging from 0.0 to 0.999. Comparison diagnostics will require two
or more units. The coverage factor depends on implementation but results
are generally good, with most results ranging from 0.9 to 0.999.

Reference diagnostics utilize the predetermined characteristics of a suc-

cessfully operating unit. Comparisons are made between actual measured
parameters and the predetermined values for these parameters. Measure-
ments of voltages, currents, signal timing, signal sequence, temperature,
and other variables are used to accurately diagnose component failures.
Advanced reference diagnostics include digital signatures and frequency
domain analysis.

Comparison diagnostic techniques depend on comparing data between

two or more operating units within a system. The coverage factor will
vary since there are tradeoffs between the amount of data compared and
diagnostic coverage effectiveness.

In some fault tolerant architectures the diagnostic coverage factor changes

when the system is operating in a degraded condition. The appropriate
coverage factor must be used in reliability and safety analysis depending
on the operational state of the system. When fault tolerant systems that
normally compare, degrade to a single operational unit, the comparison
must stop and the reference coverage factor must be used. The following
coverage factor notation is defined:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

C - Coverage due to reference diagnostics, comparison diagnostics,

or a combination of both

C1 - Coverage due to single unit reference diagnostics

C2 - Coverage due to comparison diagnostics

where C is equal to or greater than either number.

The terms C1 and C2 can be used for specific failure modes, such as safe
and dangerous, as well as resulting in terms:

CS1 - Coverage for safe failures due to single unit reference diag-
nostics

CS2 - Coverage for safe failures due to comparison diagnostics

CD1 - Coverage for dangerous failures due to single unit reference

diagnostics

CD2 - Coverage for dangerous failures due to comparison

diagnostics

Exercises
4.1 Which term requires successful operation for an interval of time:
reliability or availability?
4.2 Which term is more applicable to repairable systems: reliability or
availability?
4.3 Unavailability for a system is given as 0.001. What is the
availability?
4.4 When does the formula MTTF = 1/ apply?
4.5 Availability of the process control system is quoted as 99.9%. What
is the unavailability?
4.6 A control module has an MTTF of 60 years. Assuming a constant
failure rate, what is the failure rate?
4.7 A control module has an MTTF of 60 years. It has an average repair
time of 8 hours. What is the steady-state availability?
4.8 A control module has an MTTF of 60 years. What is the reliability
of this module for a time period of six months?
4.9 An SIS has a PFS of 0.002 and a PFD of 0.001. What is the
availability?

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Answers to Exercises
4.1 Reliability.
4.2 Availability.
4.3 Availability = 0.999.
4.4 The formula MTTF = 1/ applies to single components with a con-
stant failure rate or series systems with a constant failure rate.
4.5 Unavailability = 0.001.
4.6 The failure rate equals 0.000001903 = 1903 FITs.
4.7 Availability = 0.9999847.
4.8 Reliability = 0.9917.
4.9 Availability = 0.997.

References
1. Billinton, R., and Allan, R. N. Reliability Evaluation of Engineering
Systems: Concepts and Techniques. NY: New York, Plenum Press,
1983.

2. Wong, K. L. Off the Bathtub onto the Roller-Coaster Curve. Pro-

ceedings of the Annual Reliability and Maintainability Symposium, NY:
New York, IEEE, 1988.

3. Wong, K. L. The Physical Basis for the Roller-Coaster Hazard Rate

Curve for Electronics. Quality and Reliability Engineering Interna-
tional, Vol. 7, No. 6. Sussex, England: John Wiley & Sons Ltd., 1991.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Failure Modes and Effects Analysis (FMEA) and Failure Modes Effects
and Diagnostic Analysis (FMEDA) are commonly used analysis tech-
niques in the fields of reliability and safety engineering. Both techniques
will be discussed in this chapter.

Failure Modes and Effects Analysis

A Failure Modes and Effects Analysis (FMEA) is a systematic technique
that is designed to identify problems. It is a bottom up method that
starts with a detailed list of all components within the analyzed system. A
whole system can be analyzed one component at a time via a hierarchical
structure. A component level FMEA done on a module will provide mod-
ule level failure modes. Those failure modes from various modules can be
used in the FMEA of a unit. The results of that analysis will provide unit
level failure mode, which can be used at the system level. The sequence is
not fixed. It is possible to start at the system level with defined units and
postulated failure modes. The FMEA can be done on each grouping in the
hierarchy as required by the goals of the analysis. The primary reference
for the FMEA technique is MIL-STD-1629 (Ref. 1).

FMEA Procedure
The minimum steps required in the FMEA process (here, at the compo-
nent level) are simple:

A. List all components.

B. For each component, list all known failure modes.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
87
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
88 Control Systems Safety Evaluation and Reliability

C. For each component/failure mode, list the effect on the next

higher level.
D. For each component/failure mode, list the severity of effect.

At a module or unit level, simply list the functional failure modes of that
level. Often these modes will be identified by another lower level FMEA.

A FMEA can be very effective in identifying potential critical failures

within a device. One of the primary reasons for doing this is so that the
design can be changed to eliminate critical failures. For this reason, the
best possible time to do a FMEA is during the design phase of a project.
The FMEA should be done while design changes can still be made without
disrupting the entire project. Ideally, the completed FMEA will have no
critical failures identified. All will have been designed out!

A FMEA also provides important documentation input to a reliability and

safety evaluation. The various failure modes in components or modules
that must be modeled at the next higher level are identified. A variation of
FMEA called Failure Modes, Effects and Criticality Analysis (FMECA)
deletes the failure rate column and adds three columns for risk, priority,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

and consequence. A number from 1 to 10 is placed into each column to

provide a semi-quantitative priority scheme for problem resolution.

FMEA Limitations
Because each component is reviewed individually, combinations that
cause critical problems are not addressed. In fault tolerant systems, com-
mon cause failures (see Chapter 10) are rarely identified since they require
more than one component failure.

Operational and maintenance failures are also likely to be missed during

the FMEA unless the reviewer is skilled in human reliability analysis and
recognizes component failure modes due to human interaction. In general,
the skill and persistence of the reviewer are important to the quality of a
FMEA. All failure modes of components must be known, or they will not
be included.

FMEA Format
A FMEA is documented in a tabular format as shown in Table 5-1. Com-
puter spreadsheets are ideal tools for a FMEA. Each column in the table
has a specific definition.

Copyright International Society of Automation

Table 5-1. FMEA Format

Failure Modes and Effects Analysis
1 2 3 4 5 6 7 8 9
Name Code Function Mode Cause Effect Criticality Remarks

When the FMEA is done at the component level, column one describes the
name of the component under review. Column two is available to list the
part number or code number of the component under review. Column
three describes the function of the component. A good functional descrip-
tion of each component can do an effective job in helping to document sys-
tem operation.

Column four describes the known failure modes of the components. One
row is typically used for each component failure mode. Examples of com-
ponent failure modes include fail shorted, fail open, drift, stuck at one,
stuck at zero, etc., for electronic components. Mechanical switch failure
modes might include stuck open, stuck closed, contact weld, ground

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
short, etc. (Ref. 2 provides a database listing of failure modes for possible
system components.) Column five describes the cause of the failure mode
of column four. Generally this is used to list the primary stress causing the
failureheat, chemical corrosion, dust, electrical overload, RFI, human
operational error, etc. (see Chapter 3).

Column six describes how this component failure mode affects the func-
tion of the module (or subsystem) of which the component is a part. Col-
umn seven lists the way in which this component failure mode affects the
module (or subsystem). In safety evaluations, this column may be used to
indicate safe versus dangerous failures.

Column eight is used to list the failure rate of the particular component
failure mode. The use of this column is optional when FMEAs are being
done for qualitative purposes, and required for quantitative FMEA. When
quantitative failure rates are desired and specific data for the application
is not available, failure rates and failure mode percentages are available
from handbooks (Ref. 2).

Last, column nine is reserved for comments and relevant information.

This area gives the reviewer an opportunity to suggest improvements in
design, methods to increase the strength of the component (against the
perceived stress) and perhaps needed user documentation considerations.

Copyright International Society of Automation

TSH
TSH
1

Figure 5-1. Cooling System

EXAMPLE 5-1

Problem: Figure 5-1 shows a simplified reactor with an emergency

cooling system. The system consists of a gravity feed water tank, a
control valve (VALVE1), a cooling jacket around the reactor, a cooling
jacket drain pipe, a temperature switch-high (TSH1), and a power
supply. In normal operation, the temperature switch-high is closed
because reactor temperature is below a dangerous limit. Electrical
current flows from the power supply through the valve and the
temperature-sensing switch. This electrical current (energy) keeps
the valve closed. If the temperature inside the reactor gets too high,
the temperature-sensing switch opens. This stops the flow of
electrical current, and the control valve opens. Cooling water flows
from the tank, through the valve, through the cooling jacket, and
through the jacket drain pipe. This water flow cools the reactor,
lowering its temperature. Do a FMEA for this system.

Solution: The FMEA procedure requires that we create a table with

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

all failure modes for each of the system components. We must then
fill out every relevant column for each failure in the table. Table 5-2
shows the results of this system level FMEA. The FMEA has
identified six critical items that should be reviewed to determine the
need for correction. We could consider installing a smart IEC 61508
certified temperature transmitter with automatic diagnostics. We
could install two drain pipes and pipe them in parallel. This would
prevent a single clogged drain from causing a critical failure. A level
sensor on the water tank could warn of insufficient water level. Many
other possible design changes could be made to mitigate the critical
failures or to reduce the number of false trips.

Copyright International Society of Automation

Provided by IHS under license with ISA

Name Code Function Mode Cause Effect Criticality Remarks

Copyright International Society of Automation

Cooling Tank Water storage Leak Corrosion Lost water Dangerous 0.0001 Consider design change to detect
Plugged outlet Dirt No water Dangerous 0.0014 Second outlet?
Valve VALVE1 Open for coolant Jam closed Dirt, corr. No water Dangerous 0.00012 Second valve?
Fail open Corr., power False trip Safe 0.0002

No reproduction or networking permitted without license from IHS

Coil open Elec. surge False trip Safe 0.0001
Coil short Corr., wire False trip Safe 0.00013
Jacket Path for coolant Leak None None 0.000001
Clog Dirt, corr. No water Dangerous 0.0001 Small flow in normal operation?
Drain pipe Path for coolant Clog Dirt, corr. No water Dangerous 0.00005
Temp. switch TSH1 Sense Short No cooling Dangerous 0.0002 Use 61508 certified temperature SIF
overtemp. Open Elec. surge False trip Safe 0.0002
Power supply PowerSupply1 Energy for valve Short Maint. error False trip Safe 0.002
Open Many False trip Safe 0.004

Not for Resale, 06/01/2017 00:00:50 MDT

Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
FMEA / FMEDA
91

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
92 Control Systems Safety Evaluation and Reliability

PSH
1

PSH
2

PSH
3

Figure 5-2. High Pressure Protection System

EXAMPLE 5-2

Problem: A high pressure protection system is shown in Figure 5-2.

Three pressure switches sense pressure in a reaction tank. Each
pressure switch has two contact outputs and the six contacts are
wired in a two out of three (2oo3) voting circuit. This system is
normally energized. When all components are operating properly,
excessively high pressure in the tank opens the contacts on all
switches. The electrical circuit de-energizes and two valves close,
which cuts fuel flow to a burner. This reduces pressure in the tank to
a safe value. Do a FMEA for the system.

Solution: The FMEA is shown in Table 5-3. Notice that no critical

failures were discovered. Did the FMEA reviewer find everything? As
is typical in a FMEA, common cause failures (Chapter 10) were not
addressed. In this case, a common cause failure of two sensor
switches would fail the system. The designer might consider using
smart devices with automatic diagnostics.
--``,,`,,,`,,`,`,,,```,,,``,``,

Copyright International Society of Automation

Provided by IHS under license with ISA

Name Code Function Mode Cause Effect Criticality Remarks

Copyright International Society of Automation

Valve VALVE1 Cuts fuel flow when closed Jam open Dirt, corrosion May not cut off fuel Dangerous Second valve still cuts fuel
Fail closed Corrosion, power False trip Safe
Coil open Elec. surge False trip Safe
Coil short Corrosion, wire False trip Safe

No reproduction or networking permitted without license from IHS

Valve VALVE2 Cuts fuel flow when closed Jam open Dirt, corrosion May not cut off fuel Dangerous Second valve still cuts fuel
Fail closed Corrosion, power False trip Safe
Coil open Elec. surge False trip Safe
Coil short Corrosion, wire False trip Safe
Pressure switch PSH1 Sense overpressure Short Power surge No false trip, no danger Safe Redundant switches still work
Open Many Safe Detected?
Pressure switch PSH2 Sense overpressure Short Power surge Safe Redundant switches still work
Open Many Safe Detected?
Pressure switch PSH3 Sense overpressure Short Power surge Safe Redundant switches still work
Open Safe Detected?
Power supply Energy for valve Short Maintenance error False trip Safe Fail safe
Open Many False trip Safe Fail safe

Not for Resale, 06/01/2017 00:00:50 MDT

Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
FMEA / FMEDA
93

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
94 Control Systems Safety Evaluation and Reliability

Failure Modes, Effects and Diagnostic Analysis (FMEDA)

The FMEDA analysis technique was developed in the late 1980s, based in
part on a paper in the 1984 RAMS Symposium (Ref. 3). The FMEA
approach was extended to include an evaluation of the diagnostic ability
of any automatic on-line diagnostic or manual proof test. The actual term
FMEDA was first used in 1994 (Ref. 4) and after further refinement the
methods were published in the late 1990s (Ref. 5 and 6). FMEDA tech-
niques have been further refined during the 2000s primarily during IEC
61508 preparation work (Ref. 7 and 8).

Additional columns are added to the chart as shown in Table 5-3. The 10th
column is an extension to the original MIL standard (Ref. 1) for the pur-
pose of identifying that this component failure is detectable by a diagnos-
tic technique. A number 1 is entered to designate that this failure is
detectable. A number 0 is entered if the failure is not detectable. Column
11 is an extension to the standard used to identify the diagnostic used to
detect the failure. The name of the diagnostic should be listed. Perhaps the
error code generated or the diagnostic function could also be listed.

Column 12 is used to numerically identify failure mode. If only safe and

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

dangerous modes are under consideration, a 1 is entered for safe failure

modes and a 0 is entered for dangerous failure modes. If multiple failure
modes are being considered, then additional numbers or other designators
are entered.

The failure mode designator is used in spreadsheets to calculate the vari-

ous failure rate categories. For the safe dangerous example, the safe
detected failure rate is listed in column 13. This number can be calculated
using the previously entered values if a spreadsheet is used for the table. It
is obtained by multiplying the failure rate (Column 8) by the failure mode
number (Column 12) and the detectability (Column 10).

The safe undetected failure rate is shown in column 14. This number is
calculated by multiplying the failure rate (Column 8) by the failure mode
number (Column 12) and one minus the detectability (Column 10).
Column 15 lists the dangerous detected failure rate. The number is
obtained by multiplying the failure rate (Column 8) by one minus the
failure mode number (Column 12) and the detectability (Column 10).
Column 16 shows the calculated failure rate of dangerous undetected
failures. The number is obtained by multiplying the failure rate (Column
8) by one minus the failure mode number (Column 12) and one minus the
detectability (Column 10).

Copyright International Society of Automation

Conventional PLC Input Circuit

The FMEDA analysis is very useful for the evaluation of electronic cir-
cuits. Consider the PLC input circuit of Figure 5-3. This design is opti-
mized for low cost. In normal operation an AC voltage is applied to the
circuit terminals on the left of the drawing. A filter circuit and a voltage
divider supply reduced voltage to diodes D1 and D2. These diodes limit
the voltage to the opto-coupler circuit OC1. The output of the opto-cou-
pler conducts when AC input voltage is present, and the microcomputer
then reads a five-volt signal across R4. When the AC input voltage goes to
zero, the opto-coupler ceases to conduct, and the voltage across R4 goes to
zero.

5V ISO.
R1 R2 D2 +5V
10K 200K
VIN V1 V2
AC INPUT

0.22 F 10K D1

C1 R3
L2 R4
OC1
10K

Figure 5-3. Low-Cost PLC AC Input Circuit

The FMEDA for this circuit is shown in Table 5-4. Component failure rates
must be listed for each component failure mode since diagnostic coverage
calculations are done based on a weighted average of failure rates.

This circuit has no intrinsic diagnostic capability. The microprocessor (or

other diagnostic circuit) has no way to determine if the reading is due to a
sensor input or a failed component. This type of circuit is not appropriate
for use in high safety PLCs or high availability systems, as both safety and
availability depend on good diagnostics.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Name Code Function Mode Cause Effect Criticality Failure Rate Remarks Detectability Diagnostics Mode SD SU DD DU
R1-1K 4555-1 Input filter Short No filter Safe 0.5 0 1 0 0.5 0 0

Provided by IHS under license with ISA

Open Open contact Read 0 Safe 1 0 1 0 1 0 0

Copyright International Society of Automation

Drift low Low filter Safe 0.01 Too low 0 1 0 0.01 0 0
Drift high Safe 0.01 Too high 0 1 0 0.01 0 0
C1-0.18 4600-2 Input filter Short Overvoltage No input Safe 4 0 1 0 4 0 0
Open Open contact No filter Safe 2 0 1 0 2 0 0
R2-200K 4555-200 l limit Short Blow in Dangerous 0.5 Damage 0 0 0 0 0 0.5

No reproduction or networking permitted without license from IHS

Open Open contact Safe 1 0 1 0 1 0 0
Drift low Hi l Dangerous 0.01 Damage 0 0 0 0 0 0.01
Drift high Safe 0.01 Too high 0 1 0 0.01 0 0
R3-10K 4555-10 Pull down Short Read 0 Safe 0.5 0 1 0 0.01 0 0
Open Open contact Dangerous 1 0 0 0 0 0 1
Drift low Low (1) Low filter Safe 0.01 Too low 0 1 0 0.01 0 0
Drift high Safe 0.01 Too high 0 0 0 0 0 0.01
D1 4800-4 Voltage clip Short (2) Short in Safe 4 0 1 0 4 0 0
Open Open contact No prot. Dangerous 4 0 0 0 0 0 4
D2 4800-4 Voltage limit Short Jam in 1 Dangerous 4 0 0 0 0 0 4
Open Overvoltage Dangerous 4 Damage 0 0 0 0 0 4
OC1 5100-6 Isolator Diode short Wearout (3) Read 0 Safe 45 No light 0 1 0 45 0 0
Diode open No contact Read 0 Safe 2 No light 0 1 0 2 0 0
Control Systems Safety Evaluation and Reliability

Tran. short Jam in 1 Dangerous 10 0 0 0 0 0 10

Tran. open Read 0 Safe 10 0 1 0 10 0 0
93.56 0 0.04 0 23.5
Total failure rate 93.56 C Safe 0.00
Total safe failure rate 70.04 C Dang. 0.00
Total dangerous failure rate 23.52 C Total 0.00

Not for Resale, 06/01/2017 00:00:50 MDT

Safe detected failure rate 0
Safe undetected failure rate 70.04
Dangerous detected failure rate 0
Dangerous undetected failure rate 23.52
1. Low threshold may read as energized when not. (109 failure per hour)
2. External electrical stress may short this diode.

Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol

3. Light may also go dim slowly as a wearout mechanism.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
FMEA / FMEDA 97

VCC
OC1 D1 D2 R1
ACIN
R3 1N5948B 1N5948B 10K

100K PC844
INA16

C1 R4
0.002 F 100K R2
10K

VCC
OC2
R5
100K PC844 ACCOM
INA16

C2 R6
0.002 F
10K

Figure 5-4. High Diagnostic PLC AC Input Circuit

Critical Input (High Diagnostic) PLC Input Circuit

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Another AC input circuit in shown in Figure 5-4. This circuit was designed
for high diagnostic coverage. Very high levels of diagnostic coverage are
needed for high safety and high availability systems. The circuit uses two
sets of opto-couplers. The microprocessor that reads the inputs can read
both opto-couplers. Under normal operation both readings should be the
same. In addition, readings must be taken four times per AC voltage cycle.
This allows the microprocessor to read a dynamic signal. When all compo-
nents are operating properly, a logic 1 is a series of pulses. This circuit
design is biased toward fail-safe with a normally energized input sensor.
Table 5-5 shows the FMEDA.

Copyright International Society of Automation

Table 5-5. FMEDA for High Diagnostic Input Circuit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Criticality Failure Detect-

Name Code Function Mode Cause Effect Rate Remarks ability Diagnostics Mode SD SU DO DU

R1-10K 4555-10 Input Short Threshold Safe 0.125 0 1 0 0.13 0 0

threshold shift
Open Solder Open circuit Safe 0.5 1 Lose input 1 0.5 0 0 0
pulse
Open
Drift low Safe 0.01 None until 0 1 0 0.01 0 0
too low
Drift high Safe 0.01 None until 1 Lose input 1 0.01 0 0 0
too high pulse
R2100K 4555-100 Current Short Short input Safe 0.125 1 1 0.13 0 0 0
limit
Open Solder Safe 0.5 1 Lose input 0 0 0.5 0
pulse
Open
Drift low Safe 0.01 None until 0 1 0 0.01 0 0
too low
Drift high Safe 0.01 None until 1 Lose input 1 0.01 0 0 0
too high pulse
D1 4200-7 Voltage Short Surge Overvoltage Safe 2 1 Lose input 1 2 0 0 0
drop pulse
Open Open circuit Safe 5 1 Lose input 1 5 0 0 0
pulse
D2 4200-7 Voltage Short Surge Overvoltage Safe 2 1 Lose input 1 2 0 0 0
drop Pulse
Open Open circuit Safe 5 1 Lose input 1 5 0 0 0
pulse
OC1 4805-25 Isolate Led dim Wear No light Safe 28 1 Comp. 1 28 0 0 0
mismatch
Tran. Internal Read logic 1 Dang. 10 1 Comp. 0 0 0 10 0
short short mismatch
Tran. Read logic 0 Safe 6 1 Comp. 1 6 0 0 0
open mismatch
OC2 4805-25 Isolate Led dim Wear No light Safe 28 1 Comp. 1 28 0 0 0
mismatch
Tran. Internal Read logic 1 Dang. 10 1 Comp. 0 0 0 10 0
short short mismatch
Tran. Read logic 0 Safe 6 1 Comp. 1 6 0 0 0
open mismatch
OC1/OC2 Cross Short Same signal Dang. 0.01 0 0 0 0 0 0.01
channel
R3-100K 4555-100 Filter Short Lose filter Safe 0.125 0 1 0 0.13 0 0
Open Input float Dang. 0.5 1 Comp. 0 0 0 0.5 0
high mismatch
R4-10K 4555-10 Voltage Short Read logic 0 Safe 0.125 1 Comp. 1 0.13 0 0 0
divider mismatch
Open Read logic 1 Dang. 0.5 1 Comp. 0 0 0 0.5 0
mismatch
R5-100K 4555-100 Filter Short Lose filter Safe 0.125 0 1 0 0.13 0 0
Open Input float Dang. 0.5 1 Comp. 0 0 0 0.5 0
high mismatch
R6-10K 4555-10 Voltage Short Read logic 0 Safe 0.125 1 Comp. mis- 1 0.13 0 0 0
divider match
Open Read logic 1 Dang. 0.5 1 Comp. 0 0 0 0.5 0
mismatch
C1 4350-32 Filter Short Read logic 0 Safe 2 1 Comp. 1 2 0 0 0
mismatch
Open Lose filter Safe 0.5 0 1 0 0.5 0 0
C2 4350-32 Filter Short Read logic 0 Safe 2 1 Comp. 1 2 0 0 0
mismatch
Open Lose filter Safe 0.5 0 1 0 0.5 0 0
110.8 86.9 1.4 22.5 0.01
Total failure rate 110.8 Safe coverage 0.9839
Total safe failure rate 88.29 Dangerous 0.9996
coverage
Total dangerous failure rate 22.51
Safe detected failure rate 86.895
Safe undetected failure rate 1.395
Dangerous detected failure rate 22.5
Dangerous undetected failure rate 0.01
Failures per billion hours

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

FMEDA Limitations
The FMEDA technique is most useful on mechanical devices to accurately
show the impact of automatic diagnostic devices like partial stroke test
boxes or the effectiveness of a manual proof test procedure. The FMEDA
technique is also effective to determine the diagnostic coverage factors for
automatic diagnostics and manual proof test procedures on simple elec-
tronic circuits and processes where the failure modes of the components
are relatively well known.

Within the context of electronic devices, the technique is less useful on

complex VLSI (very large scale integrated) circuits where many possible
failure modes may exist. It is generally not considered practical to list all
possible failure modes for a microprocessor, for example. However, it is
practical to list the failure modes of processor I/O pins and processor
blocks. To allow the technique to be used even on complex integrated cir-
cuits, tables of standard function blocks have been created, with failure
modes listed (Ref. 2).

There are times when the effect of a particular component failure is not
easily predicted from analysis. A simple test technique called fault injec-
tion testing is used to quickly determine the effect. Particular component
failure modes are actually simulated in an operating unit and the impact is
observed and documented in the FMEDA.

When combined with a good component database, the FMEDA technique

can predict circuit, module, unit, or system failure rates, failure modes,
and diagnostic capability with far greater accuracy than statistical analysis
of manufacturers warranty data or even most field failure data collection
systems. A FMEDA has also been shown to be quite effective in finding
design errors.

Exercises
5.1 List the steps for a FMEA.
5.2 List the limitations of a FMEA.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

5.3 Can the results of a FMEDA be verified?

5.4 Should a FMEA or a FMEDA be reviewed?
5.5 Do a FMEA on a simple process in your plant.

Copyright International Society of Automation

Answers to Exercises
5.1 The steps for a FMEA are:
1. List all components.
2. For each component, list all failure modes.
3. For each component/failure mode, list the effect on the
next higher level.
4. For each component/failure mode, list the severity of
the effect.
5.2 A FMEA does not identify combinations of failures since each
component is reviewed individually. Operational and mainte-
nance failures may be missed. All failure modes of components
must be known or they will be missed.
5.3 Yes, within practical limits the evaluation of the diagnostic ability
of a circuit, module, unit, or system can be verified by fault injec-
tion testing.
5.4 Yes. As with most human engineering activities, a FMEA or a
FMEDA can benefit from a review.
5.5 Answer depends on process and plant.

References
1. U.S. MIL-STD-1629: Failure Mode and Effects Analysis. Springfield:
National Technical Information Service

2. Electrical & Mechanical Component Reliability Handbook Sellersville:

exida, 2006.

3. Collett, R. E. and Bachant, P. W. Integration of BIT Effectiveness

with FMECA. 1984 Proceedings of the Annual Reliability and Main-
tainability Symposium. New York: IEEE, 1984.

4. FMEDA Analysis of CDM (Critical Discrete Module) QUADLOG.

Spring House: Moore Products Co., 1994.

5. Goble, W. M. The Use and Development of Quantitative Reliability and

Safety Analysis in New Product Design Eindhoven: University Press,
Eindhoven University of Technology, 1998.

6. Goble, W. M. and Brombacher, A. C. Using a Failure Modes,

Effects and Diagnostic Analysis (FMEDA) to Measure Diagnostic
Coverage in Programmable Electronic Systems Reliability Engi-
neering and System Safety, Vol. 66, No. 2, Nov. 1999.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,

Copyright International Society of Automation

7. Goble, W. M. Accurate Failure Metrics for Mechanical Instru-

ments. Proceedings of IEC 61508 Conference. Augsberg: RWTUV,
Jan. 2003.

8. Grebe, J. C. and Goble, W. M. FMEDA Accurate Product Failure

Metrics Sellersville: exida, 2008.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Fault Tree Analysis

In contrast to FMEA, Fault Tree Analysis (FTA) is a top down approach
to the identification of system problems. It was developed by H. A. Wat-
son of Bell Laboratories in 19611962 as part of the Minuteman missile
program (Ref. 1). Fault trees were created as a qualitative tool to help iden-
tify design problems in complex systems. Fault trees are complementary
to FMEA in that they require a deductive approach to problem identifica-
tion. A fault tree is very good at pinpointing weaknesses in a system and
helps identify which parts of a system are related to a particular failure.
The end result of a fault tree analysis is a diagram that graphically shows
combinations of events that can cause system failure in an identified fail-
ure mode and helps the analyst focus on one failure at a time.

A fault tree, when well done, can also be a valuable engineering document
describing how the system is supposed to operate under various fault con-
ditions. This provides necessary documentation for more detailed reliabil-
ity and safety analysis (Ref. 2).

While it is true that fault trees are used primarily during engineering
design to help identify potential design weaknesses, fault trees can also be
of great value when investigating causes of failures or accidents. All of the
trigger events and contributing events can be documented in a graphical
format showing the overall relationship between events and a resultant
failure.

103
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Fault Tree Process

When FTA is applied to system design, the process starts by identifying a
problem, or system failure event. The fault tree review team must then
study system operation and develop a good understanding of how the
system is supposed to work. Knowing how the system is supposed to
work, the review team then tries to find all possible ways in which a par-
ticular problem could occur. For each item identified, the process contin-
ues until trigger events or basic faults are identified on the chart. The
process is diagrammed in Figure 6-1.

System Failure Event

Tree showing sequences of events

Sequences built from AND, OR or

other logic gates

Events Resulting Faults

Trigger Events Basic Faults --``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 6-1. Fault Tree Process

An example fault tree appears in Figure 6-2. We identify the system failure
event: Fire. In a normal atmosphere, we know that two additional things
are required to start a fire: an ignition source and combustible material.
Working down the tree, we identify sources of combustible material and
the basic faults involved, which include a fuel leak and a fuel spill. We also
identify trigger events that may provide an ignition source.

Copyright International Society of Automation

FIRE

AND

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
OR OR

FUEL FUEL
SPARK LEAK SPILL
SMOKING

Figure 6-2. Example Fault Tree

Fault Tree Symbols

The symbols used in a fault tree have special meanings (Figure 6-3). The
most common symbol is the event or resulting fault represented by a
rectangle. An event or resulting fault is typically a result of some combina-
tion of events as documented by the logic gates AND and OR. The dia-
mond is used to signify an incomplete event. This symbol is used to
indicate that perhaps other events could also be involved but they have a
very low probability and are judged to be insignificant to the analysis. The
hexagon symbol is called an inhibit gate. An inhibit gate is functionally
similar to a two-input AND gate except that it indicates an event that is
not necessarily a direct cause. An example of an inhibit gate is shown in
Figure 6-4. There is a chance that the operator may hit the wrong button
and this is indicated by the use of the inhibit gate. Figure 6-4 is function-
ally equivalent to Figure 6-5.

At the bottom of the fault tree are basic faults and trigger events.
These are normally considered to be the root cause elements of any failure.
The basic fault is represented by a circle. A trigger event is represented by
the house symbol.

Some additional symbols are defined in some references. While these

symbols are not common, they may appear in some diagrams (Figure 6-6).
The conditional event symbol is used with the inhibit gate to show a prob-
abilistic event. Alternatively, an oval may also be used. The transfer in and
transfer out symbols are used to connect multiple drawing sheets. A dou-
ble diamond is sometimes used to indicate that an incomplete event really

Copyright International Society of Automation

Event or Resulting Fault

Basic Fault
Incomplete Event

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Inhibit Gate
OR Gate
AND Gate
House Event
Figure 6-3. Fault Tree Symbols

Shutdown fails

Operator pushes
wrong button

Shutdown alarm
sounds

Figure 6-4. Inhibit Gate

should be investigated further. The priority AND gate is used to show that
inputs must be received in a particular sequence.

Qualitative Fault Tree Analysis

As has been mentioned, in many cases a fault tree is used to document
system design. It is also used as an inspection tool when doing system
design review. Many potential weaknesses in system design may be dis-

Copyright International Society of Automation

Shutdown fails

Shutdown alarm Operator pushes

sounds wrong button

Figure 6-5. Functional Equivalent of Inhibit Gate

Conditional Event
Alternative Conditional Event
Transfer In
Transfer Out
Incomplete Event that needs
attention.
Priority AND - event 1 before
event 2
1
2

Figure 6-6. Additional Fault Tree Symbols

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

covered. Consider the fault tree of Figure 6-7. A power system is being
reviewed. The system has three independent sources of power: the com-
mercial utility, a local generator, and a battery system. The AND gate at
the top of drawing indicates that all three sources must fail in order to lose
power.

There are many ways in which utility power can fail but the main concern
is the protective utility breaker. Therefore the drawing shows the incom-
plete event symbol indicating that other utility power failure issues are

Copyright International Society of Automation

Power system
failure

Utility Power Generator Battery system

failure failure failure

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Operator fails
to restart

Other Utility
breaker
Generator Batteries
blown
off line discharged
Charger fails

Figure 6-7. Power System Fault Tree

not significant OR utility breaker blown. Generator failure occurs when

the generator is off-line and the operator fails to restart. The battery sys-
tem fails when the batteries are discharged and the charger fails.

Quantitative Fault Tree Analysis

A fault tree can be used as a quantitative probability analysis tool. Proba-
bilities are assigned to basic faults and trigger events. The rules of proba-
bility are used to combine these numbers. Several different solution
techniques are used. Common methods include the Gate Solution method,
Cut Set Analysis, Tie Set Analysis, and the Event Space method (Ref. 3).
All methods simply combine the probabilities of the inputs therefore only
the gate solution method will be presented. Input probabilities must be
calculated with failure rates and repair rates as appropriate.

Gate Solution Method

Many fault trees are simple enough to solve by directly combining the
probabilities of the gate inputs using simple rules of probability (Appen-
dix C). Only two fundamental rules are needed, one for the AND gate
(multiply) and one for the OR gate (add). However, when the fault tree
becomes complicated, it is easy for the analyst to make a mistake. This is

Copyright International Society of Automation

especially true when events into an AND gate are not independent or
when PFDavg is to be calculated. It is also important that the probability
unions be carefully considered so that multiple instances of a given proba-
bility are calculated only once.

AND Gates
With an AND gate, all inputs must be non-zero for the gate to output a
non-zero probability. For example, with a two input AND gate, both
events A and B must have a non-zero probability for the gate output to be
non-zero. Referring to failure events, both failures must occur for the gate
to be failed with a probability. If these events are independent (see Appen-
dix C for definitions of independent events and mutually exclusive
events), then the probability of getting failure event A and failure event B
is given by the formula:

P(A B) = P(A ) P(B) (6-1)

EXAMPLE 6-1

Problem: A power system has a fault tree shown in Figure 6-8.

Unavailability of the battery system is estimated to be 0.01
(probability of failure at any moment in time). Unavailability of
commercial power is estimated to be 0.0001. What is the
unavailability of the power system?

Power System
fails

Battery Commercial
System fails Power fails
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 6-8. Power System Fault Tree

Solution: The fault tree can be applied quantitatively. Since both

power sources must fail for the system to fail, equation 6-1 can be
used.

Unavailability for the system = 0.01 0.0001 = 0.000001

Copyright International Society of Automation

EXAMPLE 6-2

Problem: Consider the inhibit gate of Figure 6-4. There is a 5%

chance that a shutdown will cause the alarm to sound. If the alarm
sounds, there is a 1% chance that the operator will push the wrong
button. What is the chance that the shutdown will fail?

Solution: Since the inhibit gate is functionally equivalent to an AND

gate, formula 6-1 applies. The chance of shutdown failure = 0.01
0.05 = 0.0005.

OR Gates
With an OR gate, any non-zero input allows the gate output to be non-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

zero. Quantitative evaluation requires summation of the probabilities. For

mutually exclusive events:

P(A B) = P(A ) + P(B) (6-2)

For non-mutually exclusive events (normally the case):

P(A B) = P(A ) + P(B) P(A B) (6-3)

EXAMPLE 6-3

Problem: Two pressure transmitters are used to sense pressure in a

reactor. The two analog signals are wired into a PLC and both signals
feed into a high select function. For a one year mission time the
probability of failure for the fail high failure mode is 0.01. If either
transmitter fails with its signal high, the pressure reading will fail. The
fault tree for this configuration is shown in Figure 6-9. What is the
probability of failure for the subsystem for a one year mission time?

Pressure signal
not available

Transmitter Transmitter
fails fails

Figure 6-9. Pressure Sensor Subsystem Fault Tree

Copyright International Society of Automation

EXAMPLE 6-3 continued

Solution: Failure of one pressure transmitter does not preclude

failure of the second; therefore, these two events are not mutually
exclusive. Equation 6-3 applies:

P(subsystem failure) = 0.01 + 0.01 - (.0001) = 0.0199

EXAMPLE 6-4

Problem: Three thermocouples are used to sense temperature in a

reactor. The three signals are wired into a PLC and a comparison
function is used to detect if the signals ever differ by more than 5%. If
they do, the sensor subsystem is assumed to have failed. The fault
tree for this configuration is shown in Figure 6-10. If unavailability of a
thermocouple is 0.005, what is the unavailability of the subsystem?

Temperature
signal not
available

Thermocouple Thermocouple Thermocouple

fails fails fails

Figure 6-10. Temperature Sensor Subsystem Fault Tree

Solution: The failure of one thermocouple does not preclude failure

of another; therefore, these three events are not mutually exclusive.
An expanded version of Equation 6-3 (See Appendix D) applies:

P(subsystem failure) = P(A) + P(B) + P(C) - P(A and B) - P(A and C) -

P(B and C) + P(A and B and C)

P(subsystem failure) = 0.005 + 0.005 + 0.005 - (0.005 0.005) -

(0.005 0.005) - (0.005 0.005) + (0.005 0.005 0.005) =
0.014925.

Approximation Techniques
Often, in order to speed up and simplify the calculation, the faults and
events in a fault tree are assumed to be mutually exclusive and indepen-
dent. Under this assumption, probabilities for the OR gates are added.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Probabilities for the AND gates are multiplied. This approximation tech-
nique can provide rough answers when probabilities are low, as is fre-
quently the case with failure probabilities. The approach is generally
conservative when working with failure probabilities because the method
gives an answer that is larger than the accurate answer. For failure proba-
bility this may be sufficient.

EXAMPLE 6-5

Problem: Three thermocouples are used to sense temperature in a

reactor as in example 6-4. Use an approximation technique to
estimate the probability of subsystem failure.

Solution: If mutually exclusive events are assumed, an expanded

version of Equation 6-2 applies:

P(subsystem failure) = P(A) + P(B) + P(C) = 0.005 + 0.005 + 0.005 =

0.015

This answer is more conservative than the answer of example 6-4,

which was 0.014925.

EXAMPLE 6-6

Problem: The fault tree of Figure 6-11 adds quantitative probability of

failure numbers to the power system example. What is the probability
of power system failure?

Power system
failure

Utility Power Generator Battery system

failure failure failure

Operator fails to
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

restart - 0.01

Utility
Other breaker
blown Generator Batteries
0.000001
0.001 off line discharged Charger fails
0.2 0.1 0.01

Figure 6-11. Power System Fault Tree Quantitative Analysis

Copyright International Society of Automation

EXAMPLE 6-6 continued

Solution: Working from the bottom up, the failure probability for
battery system failure = 0.01 0.1 = 0.001. The failure probability for
a generator failure = 0.2 0.01 = 0.002. The failure probability for
utility power failure = 0.001. The system failure probability =
0.000000002.

Common Mistake: When building a fault tree and using the Gate Solution
method, it is important to model individual components only once. It is
easy to add something in twice, especially in complex systems.

Consider the case of a pressure switch and a process connection for that
pressure switch (Figure 6-12). If the pressure switch has a PFD of 0.005
and the process connection has a PFD of 0.02, the PFD of the system could
be modeled with a fault tree OR gate as shown in Figure 6-13.

VVVV
VVVV
Process
Connection
Pressure
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Switch

Figure 6-12. Pressure Sensor Subsystem

Pressure Sensor Subsystem Fail -

Danger

Pressure Switch Pressure Connection

Fail - Danger Fail - Danger

Figure 6-13. Pressure Sensor System Fault Tree

Copyright International Society of Automation

Using a simple Gate Solution approach the union of the probabilities

would be 0.02 + 0.005 (0.02 0.005) = 0.0249. NOTE: This is not PFDavg,
only PFD (see next section).

Assume that the designer felt this probability of fail-danger was too high
and wanted a second (redundant) pressure switch. The system is designed
to trip if either switch indicates a trip (1oo2 architecture). One might

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
assume the fault tree to be two copies of Figure 6-13 with an additional
AND gate as shown in Figure 6-14.

Pressure Sensor Subsystem

Fail - Danger

Pressure Switch A Pressure Connection Pressure Switch B Pressure Connection

Fail - Danger Fail - Danger Fail - Danger Fail - Danger

Figure 6-14. Incorrect 1oo2 Pressure Sensor Subsystem Fault Tree

When solving this fault tree, one must understand whether the process
connection boxes represent two independent failure events, each with its
own probability, or if the two boxes represent one failure event. A simple
Gate Solution approach that assumes independent events would get the
answer 0.0249 0.0249 = 0.00062. If both boxes are marked identically, it
often means they represent one event. Physically this means that two pres-
sure switches are connected to a common pressure connection. In that case
the correct answer is (0.005 0.005) + 0.02 (0.02 0.005 0.005) = 0.020.
Of course it is recommended that the fault tree be drawn more clearly, as
is done in Figure 6-15.

Use of Fault Tree Analysis for PFDavg Calculations

Fault tree analysis can be used to determine the PFDavg of safety instru-
mented functions. One approach commonly used in fault tree analysis is
to calculate the PFDavg for each component and then use the rules of
probability to calculate the output of the gates. As mentioned above, this
approach is not correct for any AND gate and will lead to optimistic
results and potentially unsafe designs. This error is further described
below for various architectures.

Copyright International Society of Automation

Pressure Sensor Subsystem

Fail - Danger

Pressure Connection

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Fail - Danger

Pressure Switch A Pressure Switch B

Fail - Danger Fail - Danger

Figure 6-15. Correct 1oo2 Pressure Sensor Subsystem Fault Tree

1oo1 Architecture
For a 1oo1 component or system we can use the simplified approximation
formula below to calculate the PFDavg:

PFDavg = DT/2

where D is the undetected dangerous failure rate of the component, mod-

ule, unit, or system and T is the manual proof Testing Interval (assuming
perfect proof testingproof test techniques that always detect all danger-
ous failures).

1oo2 Architecture
In a 1oo2 architecture, two elements are available to perform the shut-
down function and only one is required. This can be represented by an
AND gate for probability of dangerous failure. Since the PFDavg repre-
sents the unavailability of the system when a demand occurs, then for a
1oo2 architecture both units A and B must fail dangerously for a loss of
safety function.

If the PFDavg is used as the input from each unit then the PFDavg, using
incorrect gate probability calculations, for the 1oo2 architecture would be

(DT/2) (DT/2) = (D)2 T2/4

(This is the incorrect result of multiplying average values, i.e., the averag-
ing is done before the logic multiplication.)

Copyright International Society of Automation

A more accurate equation to calculate the PFDavg for a 1oo2 system is:

PFDavg = (D)2 T2/3

This is obtained by averaging after the logic multiplication is done. As we

have seen, calculating the PFDavg before performing the gate calculations
results in lower PFDavg values.

T T T

This is because A B is not equal to A B
0 0 0

T
1
where PFDavg = --- PFD ( t ) dt
T
0

The correct way to solve a fault tree is to numerically calculate PFD values
as a function of time and average the numerical values (Chapter 13 of Ref.
3). Alternatively, an analytical equation for PFD can be obtained and inte-
grated (this book, Chapter 14). In both cases the solution technique per-
forms averaging after the logic has been applied.

Using a Fault Tree for Documentation

As mentioned earlier, a fault tree can be used to document system opera-
tion during the design phase. This is the most common use. In some cases,
probabilities may be assigned to failure events and system failure proba-
bilities calculated. When a system is analyzed and design problems are
revealed, the design should be changed and the fault tree should be
updated. Ideally, at the end of the design phase no serious errors exist and
the fault tree looks good.

Fault trees are also used to document system failure events. They can rep-
resent a rather complex set of circumstances and describe the situation(s)
that led to the failure in a way much clearer than words.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 6-7

Problem: On September 17, 1991, air traffic control systems in New

York failed (Ref. 4). Eventually over 85,000 passengers were delayed.
The failure was caused by a number of related events. Air traffic
control system communications were lost when telephone
transmission switching equipment lost its power. Although switching
equipment power could be supplied from three independent
sources, at the time none of the three was capable of supplying the
necessary power. The outside power lines were disconnected as part
of a load shedding agreement with the power company because it
was a hot day. The internal diesel generator was automatically taken
off-line by an overload protection relay that was erroneously set with
a low trip point. The trip point testing required by procedures was not
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

performed. The emergency batteries did supply power for six hours
until they were exhausted.

When the batteries are supplying power, alarms sound to allow

maintenance personnel to respond to the power problem before the
batteries are depleted. In this case alarms were energized on three
floors. None of the alarms were acknowledged. Alarms on the 20th
floor of the building went unacknowledged because that floor was
normally unmanned and the maintenance inspection walkthrough
that should occur every time power is switched was not done. This
was because the normal maintenance staff was attending a training
session. No substitute staff had been assigned.

Two independent alarms on the 14th floor also went unacknowledged.

In one case the normally de-energized audible alarm had a cut wire
that was not detected. The normally de-energized visual alarm had a
burned out light bulb. Alarms on the 15th floor went unacknowledged
because this area was unmanned.

Solution: A fault tree can provide an excellent graphical way to show

the situation. This tree is shown in Figure 6-16. The top level object is
communications failure. That can be caused by a number of other
things which are not of concern to us in this drawing so the
incomplete event symbol is used. The primary failure event for this
example is telephone lines fail. Note that this was a problem only
because the backup communication lines and their independent
power system were not yet in service.

The telephone lines were lost because the switching equipment

power was lost. The power failed only because all three sources
failed. The various reasons for the power source failures are shown
as events in the fault tree. Note the use of the inhibit gate for the
shutdown relay mis-set event. There is a chance this event will
happen based on a number of factors, hence the inhibit gate instead
of an AND gate. Many contributing events are indicated via the AND
gates. All of these failures had to occur before the system would fail.

Copyright International Society of Automation

Communications Failure

Other

Telephone lines fail New Communication

lines not on-line yet

Other
Switchgear power fails Battery system
failure - alarms not
Generator failure acknowledged
Utility Power
failure

Shutdown
Relay 20th floor alarm 14th floor alarm
mis-set unacknowledged unacknowledged
Other

Utility Power
switched off - 15th floor
load shedd alarm
unmanned

20th floor Alarm bell Alarm light

Maintenance wire cut bulb burned
walkby not done alarm
unmanned out

Figure 6-16. Communications Failure Air Traffic Control

Exercises
6.1 The thermocouples of Example 6-4 have an unavailability of 0.05.
What is the unavailability of the sensor subsystem? (Use the full,
accurate method)
6.2 What is the answer to exercise 6-1 using the approximate method?
What is the error?
6.3 The power system of Figure 6-11 has probabilities of failure as fol-
lows:
Utility breaker blown = 0.1
Utility other = 0.02
Generator off-line = 0.1
Operator fails to restart = 0.02
Batteries discharged = 0.2
Charger failed = 0.02
What is the probability of system failure?
6.4 What are the advantages and limitations of fault tree analysis?
6.5 Draw a fault tree to document a failure in your plant.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Answers to Exercises
6.1 0.142625
6.2 0.15; the error is almost 5%.
6.3 0.000000944.
6.4 Fault Tree Analysis is a top-down approach that is capable of iden-
tifying combinations of failure events that cause system failure. It
is systematic and allows the review team to focus on a specific fail-
ure. It is limited, however, in that it depends on the skill of the
reviewers. In addition, a fault tree can only show one failure (or
failure mode) on a single drawing. This sometimes obscures inter-
action among multiple failure modes.
6.5 Answer depends on specific circumstances.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. Henley, E. J. and Kumamoto, H. Probabilistic Risk Assessment Reli-
ability Engineering, Design and Analysis. NY: New York, IEEE Press,
1992, pg. 3.

2. Brombacher, A. C. Fault Tree Analysis; Why and When.

Advances in Instrumentation and Control, Proceedings of ISA/94 Ana-
heim. Research Triangle Park: ISA, 1994.

3. Goble, W. M. and Cheddie, H. L. Safety Instrumented Systems Verifi-

cation, Practical Probabilistic Methods. Research Triangle Park: ISA,
2005.

4. Faults and Failures. IEEE Spectrum Magazine, New York: IEEE,

Feb. 1992.

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Reliability Block Diagrams
Many modules, units, and systems with one failure mode, such as are
used in industrial control applications, can be modeled through the use of
simple block diagrams. These block diagrams are used to show probabil-
ity of success/failure combinations and may show devices (components,
modules or units) in series, in parallel, or in combination configurations.

3K\VLFDO0RGHO 8QGHUVWDQG)DLOXUH0RGHV
8QGHUVWDQG2SHUDWLRQ

5HOLDELOLW\%ORFN'LDJUDP

$QDO\]HPRGHOXVLQJ
5XOHVRI3UREDELOLW\

Figure 7-1. Reliability Block Modeling Process

The first step in the process of reliability block diagram modeling (Figure
7-1) is to convert from a physical model into a reliability block diagram
model. This step is often the hardest and is certainly the most critical. A

121
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
122 Control Systems Safety Evaluation and Reliability

good qualitative understanding of system operation during both normal

conditions and failure conditions must be acquired. A reliability block dia-
gram model is drawn with a box that represents each device that exists in
the system. Lines are drawn between the boxes to indicate operational
dependency. The block diagram model may connect very differently from
the physical model as shown in Figure 7-2 versus 7-3.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
A reliability block diagram may be viewed as showing the success
paths. For each device that is working successfully, the path goes
through a box representing that device. For each device that has failed, the
path is blocked by the box. For all combinations of successful and failed
devices, if the analyst can find a path horizontally across the reliability
block diagram, those devices are sufficient to allow the system to operate.
Consider the control system drawn in Figure 7-2. Three sensors are
present. All three sensors are wired to each of two controllers. Each con-
troller implements a voting algorithm in order to tolerate the failure of
one sensor. Either controller is capable of operating the valve.

Figure 7-2. Physical Model of Control System

The reliability block diagram model for this system is illustrated in Figure
7-3. This model shows several success paths. One such path through the
block diagram model consists of Sensor A, Sensor B, Controller A, and
Valve. If only these four devices operate, the system operates.

The rules of probability are used to evaluate the reliability block diagram.
Normally, work is done with reliability block diagram device probabili-
ties, and device failures are assumed to be independent. Sometimes it is
easier to use probability of failure, and sometimes it is easier to use proba-
bility of success. The general term probability of success may mean reli-
ability or it may mean availability. If working with non-repairable
systems, reliability (probability of success over the operating time interval
t) is used. In repairable systems, availability (probability of success at time
t) is used.

Copyright International Society of Automation

Figure 7-3. Reliability Block Diagram

Series Systems
A series system (Figure 7-4) is defined as any system in which all
devices must work for the system to work. Taking the pessimistic perspec-
tive, a series system fails if any component fails. A prime example of a
series system would be the string of antique Christmas tree lights with
which the author struggles each year. The lights are wired in such a way

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
that when one bulb fails, all the lights go out! A series system offers no
fault tolerance; there is no redundancy.

A B
Figure 7-4. Series System

Consider the series system as shown in Figure 7-4. The system has two
components; A and B, and:

RA = Probability of success for component A,

FA = Probability of failure for component A,
RB = Probability of success for component B,
FB = Probability of failure for component B,
RS = Probability of success for the system, and
FS = Probability of failure for the system.

For the two component system (assuming independent failures):

RS = RA RB (7-1)

Copyright International Society of Automation

For an n component system (assuming independent failures):

RS = Ri (7-2)
i=1

These equations are a direct result of one of the rules of probability, which
states that

P ( A B ) = P ( A ) P ( B ) for independent events

In a series block diagram it is generally easier to work with success proba-

bilities. Substituting 1 F(A) = R(A) and 1 F(B) = R(B) into Equation 7-1
gives

RS = 1 F ( A ) F ( B ) + F ( A ) F ( B ) (7-3)

This form is generally harder to use.

If a series system is composed of nonrepairable components with constant

failure rates (exponential PDF), it is possible to substitute

A t
RA = e and

B t
RB = e

into Equation 7-1, which gives

A t B t
RS = e e

This equals
( A + B )t
RS = e (7-4)

Thus, failure rates for components in a series system can be added to

obtain the failure rate for the system:
S = A + B (7-5)

The general form of the equation is

S = i (7-6)
i=1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 7-5. Example 7-1 Series System

EXAMPLE 7-1

Problem: A series system consists of a sensor, a controller module,

and a valve (Figure 7-5). The system fails if any device fails. The
devices are not repairable. The devices all have an exponential PDF
with failure rates (failures per hour) as follows:

SENSOR = 0.0001, CONTROLLER = 0.00001 and VALVE = 0.0005

What is the system reliability for a nominal one-month period,

assuming independent failures?

Solution: Since the system will fail if any one of the devices fail, the
reliability block diagram (Figure 7-5) consists of a series system. As
per Equation 7-6, the failure rates are added:

TOTAL = 0.0001 + 0.0005 + 0.00001 = 0.00061

A nominal one-month time period consists of 730 hours (8760 hours

in a year divided by 12). System reliability is, therefore:

R(730) = e-0.00061730 = 0.6406

This system has a 64% chance of operating successfully in the next

month.

EXAMPLE 7-2

Problem: A system consists of three sensors, a valve, a controller

module, and a power source. The system fails if any device fails. All
devices are repairable. The power source has an availability of 0.9.
The controller module has an availability of 0.95; the sensors have an
availability of 0.7; and the valve has an availability of 0.8. What is the
system availability?

Figure 7-6. Example 7-2 Series System

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 7-2 continued

Solution: A series reliability block diagram is applicable (Figure 7-6).

Availability ratings for all the devices are given. Since availability is
the probability of success at time t, Equation 7-2 can be used to
obtain the overall availability (probability of success) for the system:

AS = 0.950.90.70.70.70.8 = 0.2346

EXAMPLE 7-3

Problem: A control computer module is built from:

Parts List Failure Rate

17 Integrated circuits 10 FITS each
1 Microprocessor 260 FITS
2 Memory circuits (1 GB) 200 FITS each
(NOTE: 1 FIT = 1 failure per billion hours)

What is the module failure rate?

What is the module reliability for a five-year period?

Solution: Assume a series reliability block diagram, independent

failures, and components with a constant failure rate. Using Equation
7-6, the total failure rate is

TOTAL = (17 10) + 260 + (2 200) = 830 FITS

Substituting,

R(5 years) = e-0.0000008387605 = 0.9643

EXAMPLE 7-4

Problem: The memory in the control computer of Example 7-3 is

increased to eight gigabytes. This requires sixteen memory circuits
instead of two. What is the module failure rate? What is the module
reliability for a five-year period?

Solution: The additional failure rate of the extra memory must be

added to the failure rate of the other components.
TOTAL = (1710) + 260 + (16200) = 3630 FITS

Substituting,
R(5 years) = e-0.0000036387605 = 0.853

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

These examples show an important phenomenon occurring in series sys-

tems. Since each device has a probability of success less than 1, as devices
are added, the probability of system success drops.

MTTF for a Series System

Mean Time To Failure (MTTF) has been previously defined as

MTTF = R ( t ) dt
0

For a series system:

MTTF SYSTEM = RSYSTEM ( t ) dt

= [ R1 ( t ) R2 ( t ) Rn ( t ) ] dt
0

For a series system with constant device failure rates (exponential PDF):
n

i t

MTTF SYSTEM = e i = 1 dt
0

Evaluating this gives

1
MTTF SYSTEM = --------------- (7-7)
n

i
i=1
1
= -------------------------------------------
1 + 2 + + n

Therefore, for a series system with constant device failure rates where

TOTAL = 1 + 2 + + n

1
MTTF SERIES SYSTEM = -------------------- (7-8)
TOTAL

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Parallel Systems
A parallel system is defined as a system that is successful if any one of
the devices is successful. From the pessimistic perspective, the parallel
system fails only if all of its devices fail. A parallel system offers fault tol-
erance that is accomplished through redundancy.

Consider a two-device parallel system as shown in Figure 7-7. Two

devices are present, A and B. Assume that the device failures are indepen-
dent. The system is successful if either A or B is successful.

A
B
Figure 7-7. Parallel System

RA = Probability of success for component A,

Since the system works if either A or B works, the probability summation

rule can be used to obtain a formula for system success:

RS = RA + RB RA RB (7-9)

Note that the system fails only if both A and B fail. The formula for system
failure is:

FS = FA FB (7-10)

For an n-component system:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

n
FS = Fi (7-11)
i=1

Copyright International Society of Automation

To obtain the probability of success for an n component system, the rule of

complementary events is used:

n
RS = 1 FS = 1 Fi (7-12)
i=1

EXAMPLE 7-5

Problem: A parallel system consists of three pressure sensors. The

system operates if any one of the sensors operates (Figure 7-8).
Each sensor has a probability of success of 0.8. What is the
probability of system success?

SENSOR

Figure 7-8. Example 7-5 Parallel System

Solution: Using Equation 7-12:

RS = 1 ((1 0.8)(1 0.8)(1 0.8))= 1 0.008 = 0.992

EXAMPLE 7-6

Problem: A reactor cooling jacket is fed from two independent series

cooling systems. Each cooling system consists of a water tank, a
pump, and a power source. The component reliabilities are:

Water Tank 0.95

Pump 0.7

Power Source 0.85

Either cooling system will fail if any of its devices fails, but the jacket
will be cooled if either cooling system works. What is the probability
that the jacket will not be cooled?

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 7-6 continued

WATER POWER
PUMP
TANK SOURCE

WATER POWER
TANK PUMP
SOURCE

Figure 7-9. Example 7-6 System

Solution: This is a combination series-parallel system (Figure 7-9).

First, obtain the reliability of a cooling system:

RCooling System = 0.95 0.7 0.85 = 0.56525

The reactor jacket will be cooled if either cooling system works.

Therefore,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

RCooled = 1 (1 0.56525)2 = 0.811

The probability that the reactor will not be cooled is

FCooled = 1 0.811 = 0.189

In this case, remember that there is a quicker way to calculate the

needed answer: use Equation 7-6:

FCooled = (1 0.56525)2 = 0.189

EXAMPLE 7-7

Problem: A second sensor and a second valve can be added in

parallel with the sensor and with the valve for the system of Example
7-1. The resulting reliability block diagram model is shown in Figure
7-10. What is the probability of system success for a nominal one-
month period?

SENSOR VALVE

CONTROLLER

SENSOR VALVE

Figure 7-10. Example 7-7 System

Copyright International Society of Automation

EXAMPLE 7-7 continued

Solution: First, calculate the reliability of each sensor and each

valve. For a 730-hour period:

RSensor = e-0.0001730 = 0.9296

and

RValve = e-0.0005730 = 0.6942

Using these probabilities, calculate the reliability of a parallel

combination of sensors:

RP Sensor = 1 (1 0.9296)2 = 0.9950

Calculate the reliability of a parallel combination of valves:

RP Valves = 1 (1 0.6942)2 = 0.9065

Calculate the reliability of the controller for the 730-hour period:

RController = e-0.00001730 = 0.9927

Finally, evaluate the series combination of three devicesthe parallel

sensors, the controller, and the parallel valves:

RSYSTEM = 0.9950 0.9927 0.9065 = 0.8954

The chances of successful operation for the next month are now
0.8954 (90%). This is a substantial improvement over the results of
Example 7-1 in which a one month reliability of 0.6406 was
calculated.

MTTF of a Parallel System

Substituting the parallel system reliability equation, Equation 7-9, into the
MTTF definition yields:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

MTTF SYSTEM = [ RA + RB RA RB ] dt
0

for a two device parallel system. If these two devices have constant failure
rates (exponential PDF), then:

A t B t ( A + B ) t
MTTF SYSTEM = [e +e e ] dt
0

Copyright International Society of Automation

This evaluates to:

1 A t -----
1 B t -------------------
1 ( A + B ) t
------- e -e + e
A B A + B 0

Evaluating the exponentials:

1 1 1
MTTF PARALLEL SYSTEM = ------- + ------ ------------------- (7-13)
A B A + B

Notice that MTTF does not equal one over . A parallel system of devices
with constant failure rates no longer has an exponential PDF.

1
MTTF PARALLEL SYSTEM --------------------
TOTAL

The equation MTTF equals one over does not apply to any system that
has parallel devices. This includes any system that has redundancy. Triple
modular redundant systems, dual systems, and partially redundant sys-
tems are all in this category.

k-out-of-n Systems
Many systems, especially load sharing systems, are designed with extra
devices. In systems where the extra devices can share the load without
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

complex switching mechanisms, this scheme can be an effective form of

redundancy.

Consider an example of a temperature sensing system with five thermo-

couples. The physical model is shown in Figure 7-11. The system requires
at least four of the five thermocouples to be operating successfully. If two
or more thermocouples fail, the system will fail. This particular system is
known as a four out of five (4oo5) system; four of the five successful
thermocouples are required for system success.

The number of combinations is given by

5 = -----------------------
5!
- = 5
4 4! ( 5 4 )!

Notice the notation used. Two numbers stacked within parenthesis is the
notation for combinations of numbers. (See Appendix C for a detailed
explanation of combinations.)

Copyright International Society of Automation

Thermocouple 1
Thermocouple 2
Thermocouple 3
Thermocouple 4
Thermocouple 5
TANK

Figure 7-11. Physical Model, Four out of Five

Given that there are five combinations of four devices, we can build a
series/parallel reliability block diagram. The block diagram in Figure 7-12
has five parallel paths. Each path has four devices in series. An examina-
tion of the block diagram shows that each path is a different combination
of four devices (out of a possible five).

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 7-12. Reliability Network, Four out of Five

It is easy to calculate the probability of success for one path. Using Equa-
tion 7-2, the probability of success for the top path is

RPATH1 = R1 R2 R3 R4

The probabilities of success for the remaining paths are:

Copyright International Society of Automation

RPATH2 = R1 R2 R3 R5

RPATH3 = R1 R2 R4 R5

RPATH4 = R1 R3 R4 R5

RPATH5 = R2 R3 R4 R5

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The probability of success for the system is the union of these five path
probabilities. These numbers cannot merely be added. The four paths are
NOT mutually exclusive. To obtain the union, the path probabilities must
be added and then the combinations of two intersections must be sub-
tracted; the combinations of three intersections must be added, then the
combinations of four intersections subtracted; and the combinations of
five intersections must be added. This equation would be quite long!

Fortunately, if all devices are the same, the calculation can be drastically
simplified. Note that the intersection of path 1 and path 2 is:

RPATH1 RPATH2 = R1 R2 R3 R4 R5 = RI

The same result occurs for all intersections including intersections of two
at a time, three at a time, four at a time, and even five at a time. Taking
advantage of this fact, a reasonable equation can be written for the union
of the five paths.

R SYSTEM = R PATH1 + R PATH2 + R PATH3 + R PATH4 + R PATH5

5 5 5 5
RI + RI RI + RI
2 3 4 5

Expanding the combinations, this equals

R SYSTEM = R PATH1 + R PATH2 + R PATH3 + R PATH4 + R PATH5

10R I + 10R I 5R I + 1R I

Adding the terms produces:

R SYSTEM = R PATH1 + R PATH2 + R PATH3 + R PATH4 + R PATH5 4R I

If all the devices in the model are the same, further simplification is possi-
ble. In the thermocouple system example, all the thermocouples are the
same. Therefore:

Copyright International Society of Automation

R1 = R2 = R3 = R4 = R5 = R

The equation simplifies to

4 5
R SYSTEM = 5R 4R (7-14)

EXAMPLE 7-8

Problem: The system of Figure 7-11 uses thermocouples that have a

one year reliability of 0.9. What is the system reliability for a one year
interval?

Solution: The system reliability of this 4oo5 system is calculated by

Equation 7-14. Using a value of 0.9:
4 5
R SYSTEM = 5 ( 0.9 ) 4 ( 0.9 )
= 3.2805 2.362
= 0.9185

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The general form of the equation can be derived by answering the ques-
tion, When is the system successful? The system is successful when:

1. All five thermocouples are good, or

2. Four thermocouples are good and one is bad.

What is the probability of getting one combination of four good and one
bad? This can be written as

R R R R (1 R)

In the general sense, this is written

k nk
R (1 R)

for k-out-of-n devices. Since more than one combination of k-out-of-n

devices allows system success, the number of combinations is multiplied
by the probability of each combination. This is written:

n R k ( 1 R ) n k (7-15)
k

Copyright International Society of Automation

This is the probability of realizing any combination of k-out-of-n devices.

The system is successful if k or more devices are successful, which is math-
ematically described by
n
n
i R ( 1 R )
i ni
R SYSTEM = (7-16)
i=k

For the 4oo5 thermocouple problem, n equals five and k equals four. Sub-
stituting these values into Equation 7-16:
4 1 5 0
R SYSTEM = 5 [ R ( 1 R ) ] + 1 [ R ( 1 R ) ]

Simplifying yields
4 5
R SYSTEM = 5R 4R (7-17)

Of course, this is the same as Equation 7-14 showing that both derivation
methods yield the same result.

Note on k-out-of-n systems: As with all reliability modeling, the analyst

must be careful to recognize the reality of the application. In some k-out-
of-n systems, this is very confusing especially for sensor subsystems.
Often many sensors are installed in an area to sense a condition. If every
single sensor can individually sense the condition, then that is very
different from the situation where a specific condition is sensed only by
one or more sensors. For example, six toxic gas detection sensors are
installed in a room because one sensor may not detect the toxic gas. The
controller is programmed to sound an alarm and vent the area if any two
of the six sensors indicate the presence of toxic gas. The probability of
success model is NOT 2 out of 6 (2oo6); it is likely 2 out of 2 (2oo2) for a
particular portion of the room. A careful interview with the system
designer will usually reveal the reality of the application.

Control
Computer

Control Voting
Computer Logic

Control
Computer

Figure 7-13. Example 7-9, Physical System

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 7-14. Example 7-9, Reliability Model

EXAMPLE 7-9

Problem: Three control computers are wired to a voting logic circuit.

A physical system diagram is shown in Figure 7-13. The outputs of
the computers are compared. The output of the voting logic circuit
equals the majority of its inputs. If one computer fails, the other two
successful computers will outvote the bad unit. The system remains
successful. If a second computer fails, the two bad units will outvote
the single good unit and the system will fail. All computers and the
voting logic circuit are repairable. Each computer has an availability of
0.95. The voting logic circuit has an availability of 0.98. What is the
system availability?

Solution: This system requires that two out of three (2oo3) control
computers be successful. The voting logic circuit must also be
successful. The reliability block diagram is shown in Figure 7-14. We
will first solve the left side of the reliability block diagram.

Using Equation 7-16 (substituting numbers),

2 1 3 0
A 2oo3 = 3 [ ( 0.95 ) ( 1 0.95 ) ] + [ ( 0.95 ) ( 1 0.95 ) ]
2 3
= 3 ( 0.95 ) 2 ( 0.95 )
= 0.99275

Since the voting logic circuit is in series,

A SYSTEM = 0.99275 0.98 = 0.9729

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Quantitative Block Diagram Evaluation

Some reliability block diagrams are more than simple series/parallel com-
binations. These block diagram topologies require different solution tech-
niques. One of most straightforward methods is called the event space
method.

Copyright International Society of Automation

Event Space Method

The event space method involves the creation of all combinations of possi-
ble component (or module) failures. It is necessary to identify which com-
binations represent system success and which combinations represent
system failure. The probability of system success is obtained by adding the
combinational probabilities. Those probabilities can be summed because
each combination is mutually exclusive. If a system has n components or
modules, each with one failure mode, there will be 2n combinations. It is
obvious that the method will grow impractical as the number of compo-
nents or modules increases. For smaller block diagrams, the method can
be effective because it is very good at showing the overall picture.

The control system shown in Figure 7-15 has three sensors, two control-
lers, and two valves. For the system to be successful, a sensor must signal
a controllervalve set. The two controllers each have two inputs. Each
controller is wired to two of the sensors and can operate successfully if one
of the two sensors to which it is wired is successful. The sensors have an
availability of 0.8. The controllers have an availability of 0.95. The valves
have an availability of 0.7.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 7-15. Control System

To start the analysis, each controller and its associated valve can be treated
as a series system with an availability of

A CONT/VALVE = 0.95 0.7 = 0.665

The simplified system is shown in Figure 7-16. Box D is controller A and

valve A. Box E is controller B and valve B.

Copyright International Society of Automation

Figure 7-16. Simplified System

The probability of success for this system can be obtained using the event
space method. With five devices in the block diagram, 32 (25 = 32) combi-
nations of devices can be expected. The combinations are listed in groups
according to the number of failed devices. Group 0 has one combination,
all devices successful. It is listed as item 1 in Table 7-1.

Table 7-1. Event Listing: Group 0, Group 1

Item Combination
Group 0 1 ABCDE
Group 1 2 A* B C D E
3 A B* C D E
4 A B C* D E
5 A B C D* E
6 A B C D E*

Next, all combinations of one failure are listed. With five devices, five
combinations are expected, with failed devices marked with an asterisk.
These are listed as items two through six in Table 7-1.

In the next step, all combinations of two failures are listed. The equation
for combinations (Equation C-16, Appendix C) is used to determine that
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

there are ten combinations of five items taken two at a time.

5 = -----------------------
5! -
= 10
2 2! ( 5 2 )!

These combinations are listed as items 7 through 16 in Table 7-2. Notice

that combinations of A, B, and C were developed.

Copyright International Society of Automation

Table 7-2. Event Listing: Group 2

Item Combination
Group 2 7 A* B* C D E
8 A* B C* D E
9 A* B C D* E
10 A* B C D E*
11 A B* C* D E
12 A B* C D* E
13 A B* C D E*
14 A B C* D* E
15 A B C* D E*
16 A B C D* E*

Group 3 consists of all combinations of three failures. 10 combinations of

five items, taken three at a time, are expected.

5 = -----------------------
5!
- = 10
3 3! ( 5 3 )!

These combinations are listed as items 17 through 26 in Table 7-3.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Table 7-3. Event Listing: Group 3

Item Combination
Group 3 17 A* B* C* D E
18 A* B* C D* E
19 A* B* C D E*
20 A B* C* D* E
21 A B* C* D E*
22 A* B C* D* E
23 A B C* D* E*
24 A* B C D* E*
25 A B* C D* E*
26 A* B C* D E*

Group 4 consists of all combinations of four failures. Five combinations

are expected. These are listed along with the single combination from
Group 5 (all failures) as items 27 through 32 in Table 7-4.

After listing all combinations of failed devices, the combinations that will
cause system failure are identified. Of course, the combination in Group 0

Copyright International Society of Automation

Table 7-4. Event Listing: Group 4, Group 5

Item Combination
Group 4 27 A B* C* D* E*
28 A* B C* D* E*
29 A* B* C D* E*
30 A* B* C* D E*
31 A* B* C* D* E
Group 5 32 A* B* C* D* E*

represents system success since no devices have failed. Continuing, Figure

7-16 is examined. A path across the reliability block diagram exists for any

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
combination of one failed device. It can be concluded then, that all Group
1 combinations represent system success. Group 2 must be examined care-
fully; however, combination 7 has devices A and B failed. Looking again
at Figure 7-16, a path still exists across the block diagram using devices C
and E; therefore, the system is still successful with combination 7. As the
other combinations of two failures are examined, no system failures can be
found until combination 16. That combination has devices D and E failed.
There is no path across the block diagram when these devices fail. This is
illustrated in Figure 7-17.

Figure 7-17. Network Failure: D and E Fail

Examination of Group 3 shows a system failure with combination 17:

devices A, B, and C are failed. With all sensors failed, the system cannot
work. A close examination of the remainder of Group 3 shows that only
four combinations will allow system success: items 18, 21, 22, and 26.
These combinations have one sensor and a connecting controller/valve
operating. All combinations of Group 4 and Group 5 cause system failure.
Recall that the system requires at least one sensor and at least one control-
ler/valve working, which does not happen in any combinations of those

Copyright International Society of Automation

groups. Table 7-5 compiles the list of all combinations and shows which
combinations are successful.

Table 7-5. Event Combination List

Item Combination System
Group 0 1 ABCDE Success
Group 1 2 A* B C D E Success
3 A B* C D E Success
4 A B C* D E Success
5 A B C D* E Success
6 A B C D E* Success
Group 2 7 A* B* C D E Success
8 A* B C* D E Success
9 A* B C D* E Success
10 A* B C D E* Success

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
11 A B* C* D E Success
12 A B* C D* E Success
13 A B* C D E* Success
14 A B C* D* E Success
15 A B C* D E* Success
16 A B C D* E* Failure
Group 3 17 A* B* C* D E Failure
18 A* B* C D* E Success
19 A* B* C D E* Failure
20 A B* C* D* E Failure
21 A B* C* D E* Success
22 A* B C* D* E Success
23 A B C* D* E* Failure
24 A* B C D* E* Failure
25 A B* C D* E* Failure
26 A* B C* D E* Success
Group 4 27 A B* C* D* E* Failure
28 A* B C* D* E* Failure
29 A* B* C D* E* Failure
30 A* B* C* D E* Failure
31 A* B* C* D* E Failure
Group 5 32 A* B* C* D* E* Failure

Copyright International Society of Automation

To calculate the probability of system success, the probabilities of the suc-

cessful combinations are added. This can be done because each combina-
tion is mutually exclusive. The probability of each combination is obtained
by multiplying the probabilities of device success and device failure. For
example, the probability of combination 1, where all devices are success-
ful, is

A COMB1 = 0.8 0.8 0.8 0.665 0.665 = 0.226419

The probability of combination 2, in which device A has failed, is

A COMB2 = 0.2 0.8 0.8 0.665 0.665 = 0.056605

Notice that the probability of failure for device A is multiplied by the

probability of success for other devices when calculating the combination
2 probability. When calculating combination probabilities, the device
probabilities of success and failure are used. The probability of combina-
tion 3 and of combination 4 is the same as for combination two. The prob-
ability of combination 5 is

A COMB5 = 0.8 0.8 0.8 0.335 0.665 = 0.114061

A complete listing of all device probabilities and the combination proba-

bilities is presented in Table 7-6.

If the combination probabilities are added, the result should equal 1. This
is a good checking mechanism.

Adding the successful combination probabilities gives an answer of

0.866415. The failure combination probabilities total 0.133585. These two
numbers also sum to 1. The event space method appears quite tedious, but
personal computer spreadsheet programs can be quickly set to solve the
problems automatically. This tool makes the method quite acceptable,
considering the qualitative overview that the method provides in addition
to quantitative results.

Approximation of Failure Probability

Reliability/availability calculations are done for a number of reasons. At
the design stage of a project, comparisons are made between alternative
designs. For this purpose, an approximation that can be done quickly may
be a better use of engineering time. As long as the comparative numbers
are not close, the approximation is sufficient to make a decision. At other
times, calculations are done to determine if the proposed system meets a
specification. Such a specification might state, The system shall have an

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Table 7-6. Probability Combination Listing

Item Device1 Device2 Device3 Device4 Device5 Comb.
1 0.8 0.8 0.8 0.665 0.665 0.226419
2 0.2 0.8 0.8 0.665 0.665 0.056605
3 0.8 0.2 0.8 0.665 0.665 0.056605
4 0.8 0.8 0.2 0.665 0.665 0.056605
5 0.8 0.8 0.8 0.335 0.665 0.114061
6 0.8 0.8 0.8 0.665 0.335 0.114061
7 0.2 0.2 0.8 0.665 0.665 0.014151
8 0.2 0.8 0.2 0.665 0.665 0.014151
9 0.2 0.8 0.8 0.335 0.665 0.028515
10 0.2 0.8 0.8 0.665 0.335 0.028515
11 0.8 0.2 0.2 0.665 0.665 0.014151
12 0.8 0.2 0.8 0.335 0.665 0.028515
13 0.8 0.2 0.8 0.665 0.335 0.028515
14 0.8 0.8 0.2 0.335 0.665 0.028515
15 0.8 0.8 0.2 0.665 0.335 0.028515
16 0.8 0.8 0.8 0.335 0.335 0.057459
17 0.2 0.2 0.2 0.665 0.665 0.003538
18 0.2 0.2 0.8 0.335 0.665 0.007129
19 0.2 0.2 0.8 0.665 0.335 0.007129
20 0.8 0.2 0.2 0.335 0.665 0.007129
21 0.8 0.2 0.2 0.665 0.335 0.007129
22 0.2 0.8 0.2 0.335 0.665 0.007129
23 0.8 0.8 0.2 0.335 0.335 0.014365
24 0.2 0.8 0.8 0.335 0.335 0.014365
25 0.8 0.2 0.8 0.335 0.335 0.014365
26 0.2 0.8 0.2 0.665 0.335 0.007129
27 0.8 0.2 0.2 0.335 0.335 0.003591
28 0.2 0.8 0.2 0.335 0.335 0.003591
29 0.2 0.2 0.8 0.335 0.335 0.003591
30 0.2 0.2 0.2 0.665 0.335 0.001782
31 0.2 0.2 0.2 0.335 0.665 0.001782
32 0.2 0.2 0.2 0.335 0.335 0.000898
Total 1

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

unavailability of no greater than 10%. In such situations, an approxima-

tion may serve if the approximation always provides worst-case answers
(probability of failure numbers should always come out high; probability
of success numbers should always come out low).

Bounding approximations using reliability block diagrams can be done

with relatively little effort. The approximation techniques result in worst-
case answers. If these answers can serve the intended purpose, engineer-
ing productivity can be improved. Consider the system of Figure 7-15. We
can roughly approximate system failure if we look at Figure 7-16. The sys-
tem fails if blocks A, B, and C fail or if D and E fail or if A, B, and E fail or
if B, C, and D fail. This is approximated by the equation:

F SYSTEM F ABC + F DE + F ABE + F BCD (7-18)

Substituting the numerical values produces

F SYSTEM 0.008 + 0.112225 + 0.0134 + 0.0134

= 0.147025

This is known as the upper bound on failure probability. Compare this

to the previous result, 0.133585. The approximate answer is clearly more
pessimistic. If this result indicates that the system will meet specification,
the additional work of getting a precise answer is not necessary.

Approximation of Success Probability

In a manner similar to the failure approximation techniques, the probabil-
ity of system success can be approximated. The upper bound of success
probability is obtained using only the probabilities of success for each suc-
cessful set of devices in the block diagram. For the example:

R SYSTEM = R AD + R BD + R BE + R CE (7-19)

Substituting numerical values gives a result of 2.128. Since a probability

cannot exceed 1, this upper bound result is not very useful. The lower
bound on probability of success is obtained by subtracting the two at a
time intersections. For example:

R SYSTEM = R AD + R BD + R BE + R CE

( R AD R BD ) ( R AD R BE ) (7-20)

( R AD R CE ) ( R BD R BE ) ( R BD R CE ) ( R BE R CE )
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Substituting the numerical values gives

R SYSTEM = 0.532 + 0.532 + 0.532 + 0.532 0.4256 0.0283024

0.0283024 0.35378 0.283024 0.4256 = 0.07395

Again, not a very useful result. The only thing determined from the suc-
cess approximation approach is that there is a probability of success some-
where between 0.07395 and 1. The method works much better when
device probabilities of success are low. In our example, the numbers are
near 1.

In general, if device success probabilities are low, use a success approxi-

mation model; if device success probabilities are high, use a failure
approximation model. The approximation techniques will then be useful.

Exercises
7.1 A system consists of a power source, a pump, and a valve. The sys-
tem operates only if all three components operate. The power
source has an availability of 0.8. The pump has an availability of
0.7. The valve has an availability of 0.75. Draw a reliability block
diagram of the system. What is the system availability?
7.2 A second pump and a second valve are added in parallel with the
system of Exercise 7.1. This is shown in Figure 7-18. What is the
system availability?

Figure 7-18. Exercise 7.2 System

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

7.3 A control system consists of four sensors, an input module, a con-

troller, an output module, and two valves. If any component or
module fails, the system fails. All devices are repairable. The com-
ponent/module availabilities are:

Copyright International Society of Automation

Sensor 0.8
Input Module 0.95
Controller 0.9
Output Module 0.95
Valve 0.75
What is the system availability?
7.4 A controller consultant suggests that your control system (the sys-
tem from Exercise 7.3) could be improved by adding a redundant
controller. If a second controller is put in parallel with the first,
how much does system availability improve?
7.5 A nonrepairable controller module has a constant failure rate of
500 FITS. What is the MTTF for the controller?
7.6 Two nonrepairable controller modules with a constant failure rate
of 500 FITS are used in parallel. What is the system MTTF?
7.7 You wish to approximate the probability of failure for a reliability
block diagram. All block diagram devices have a probability of
success in the range of 0.95 to 0.998. Which model method should
you use?
7.8 You have a reliability block diagram with six devices. Each device
has one failure mode. How many combinations must be listed in
an event space evaluation?
7.9 You have a reliability block diagram with four devices. Each
device has two failure modes. How many combinations must be
listed in an event space evaluation?

Answers to Exercises
7.1 This is a series system. The system availability is 0.8 0.7 0.75 =
0.42
7.2 System availability equals 0.6195.
7.3 System availability equals 0.187, not very impressive for compo-
nents with such high availabilities. Note that the system availabil-
ity is always much lower than component availabilities. (Did you
get an answer of 0.487? Do not forget there are four sensors and
two valves.)
7.4 System availability now equals 0.206, not much of an increase.
7.5 MTTF = 2,000,000 hours
7.6 System MTTF = 3,000,000 hours

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

7.7 High success probabilities are given, so use a failure approxima-

tion method which will work with small numbers correctly.
7.8 Success + one failure mode = two modes, six components; there-
fore, 26 = 64.
7.9 Success + two failure modes = three modes, four components;
therefore 34 = 81.

References and Bibliography

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1. Billinton, R. and Allan, R. N. Reliability Evaluation of Engineering
Systems: Concepts and Techniques. NY: Plenum Press, 1983.

2. Dhillon, B. S. Reliability Engineering in Systems Design and Operation.

NY: Van Nostrand Reinhold Co., Inc., 1983.

3. Henley, E. J. and Kumamoto, H. Probabilistic Risk Assessment: Reli-

ability Engineering, Design, and Analysis. Piscataway: IEEE Press,
1992.

4. Shooman, M. L. Probabilistic Reliability: An Engineering Approach;

Second Edition. Malabar: Robert E. Krieger Publishing Co., Inc.,
1990.

Copyright International Society of Automation

Repairable Systems
Repairable systems are typical in an industrial environment. It is possible
and highly desirable to install systems in places where they can be
repaired. Such systems offer many advantages in terms of system avail-
ability and safety. Several different fault tolerant system configurations of
repairable modules have been created. Some systems are fully repairable.
All components in the system can be repaired. In some systems, not all
components can be repaired. These are called partially repairable systems.

Repairs take time. Probability combination modeling methods (fault trees

and reliability block diagrams) do not directly account for repair time.
Probabilities must be calculated using other methods, then input to the
fault tree or the reliability block diagram. A better method is needed to
model the failure performance of a repairable system over a wide range of
failure rates and repair times. The method must account for realistic repair
times, probability of correct repair, proof test effectiveness, various system
configurations, and realistic system features, including automatic diag-
nostic testing. The technique must apply both to systems where all devices
are repairable and systems where only some devices can be repaired.

Markov Models
Markov modeling, a reliability and safety modeling technique that uses
state diagrams, can fulfill these goals. The Markov modeling technique
uses only two simple symbols (Figure 8-1). It provides a complete set of
evaluation tools when compared with many other reliability and safety
evaluation techniques (Ref. 1).

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
149
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
150 Control Systems Safety Evaluation and Reliability

Transition Arc

State

Figure 8-1. Markov Model Symbols

An example Markov model is shown in Figure 8-2. This model shows how
the symbols are used. Circles (states) show combinations of successfully
operating devices and failed devices. Each state is given an identification
number unique to each model. Possible device failures and repairs are
shown with transition arcsarrows that go from one state to another.
Some states are called transient states. These states have both in and out
arcs (Figure 8-2, states 0,1,2 and 4). Some states are called absorbing
states. These have only incoming arcs (Figure 8-2, states 3 and 5).

Fail
Energized
3
Degraded
Detected
1
OK Comm
0 loss
4
Degraded
Undetected
2
Fail Safe
5

Figure 8-2. Markov Model

A number of different combinations of failed and successful devices are

possible. Some represent system success states while others represent sys-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

tem failure states. It should be noted that multiple failure modes for a
device can be shown on one drawing with more than one failure state cir-
cle (Figure 8-2, state 3 and state 5).

On a single drawing, a Markov model can show the entire operation of a

fault tolerant control system. If the model is complete, it will show full sys-
tem success states. It will also show degraded states where the system is
still operating successfully but is vulnerable to further failures. The draw-
ing will also show all failure modes.

Markov Model Metrics

Markov modeling is the most flexible modeling technique available for
control system evaluation. A number of different reliability and safety
measurements can be generated from a Markov model. The probability of
continuous successful operation over a time interval (the narrow defini-
tion of reliability) can be calculated. In addition availability, the proba-
bility of system success at time t, can be calculated. Other measures of
system success, including MTTF (Mean Time to Failure), can be obtained
by using certain solution techniques with the Markov model.

For repairable systems, availability is obtained either as a function of time

or as a single steady-state value. System reliability as a function of time is
obtained by ignoring repair arcs from failure states to success states and
by using similar solution techniques. MTTF or MTTR (Mean Time to
Restore) values can be obtained using linear algebra techniques.

A full suite of safety metrics (Chapter 4) can also be calculated from a

Markov model. Since one of the solution techniques is oriented toward
providing state probabilities as a function of time, the probability of fail-
ure on demand (PFD) can be obtained by adding probabilities from states
where the system will not respond to a demand (Fail-danger states). PFD
is provided as a function of time as part of the solution process. MTTFS
(MTT Fail Safe), MTTFD (MTT Fail Dangerous) and other mean failure
time measurements can be precisely calculated using Markov model solu-
tion techniques.

Solving Markov Models

Andrei Andreyevich Markov (18561922), a Russian mathematician, stud-
ied probability while teaching at St. Petersburg University in the late
1800s. He defined the Markov process, a process in which the future
variable is determined by the present variable but is independent of pre-
decessors [Ref. 2]. Markov gave particular emphasis to Markov chains
sequences in which the variable takes on particular discrete values. That
work has been extensively developed over the years. These methods
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

apply nicely to the failure/repair process because combinations of failures

create discrete system states.

The Markov model building technique involves the definition of all mutu-
ally exclusive success/failure states in a device. These are represented by
labeled circles. The system can transition from one state to another when-
ever a failure or a repair occurs. Transitions between states are shown
with arrows (transition arcs) and are labeled with the appropriate failure
or repair probabilities (often approximated using failure/repair rates).
This model is used to describe the behavior of the system with time. If
time is modeled in discrete increments (for example, once per hour), simu-
lations can be run using the probabilities shown in the models. Calcula-
tions can be made showing the probability of being in each state for each
time interval. Since some states represent system success, the probabilities
of these states are added to obtain system reliability or system availability
as a function of time. Many other related metrics are also obtained using
various model solution techniques.

A Markov model for a single nonrepairable component with one failure

mode is shown in Figure 8-3. Two states are present. In state 0, the compo-
nent is operating successfully. In state 1, the component has failed. One
transition that represents component failure shows movement from state 0
to state 1. That arrow is labeled with the lowercase Greek letter lambda
(), which represents the failure rate of the component during a time incre-
ment (t). The failure rate times a time increment (t) represents the prob-
ability of making that movement during the time increment.

L$t
OK Fail
0 1
Figure 8-3. Markov Model, Single Nonrepairable Component

A single repairable component with one failure mode has a Markov model
as shown in Figure 8-4. The two states are the same as previously
described for the nonrepairable component. Two transitions are present.
The upper transition represents a failure probabilitymovement from
state 0 to state 1. The lower transition represents a restore probability
movement from failure state 1 to success state 0. The restore rate is

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

represented by the lowercase Greek letter mu (). The repair rate times a
time increment (t) represents the probability of making that movement
during the time increment.

L$t
OK Fail
0 1
M$t
Figure 8-4. Markov Model, Single Repairable Component

Markov models can represent nonrepairable, partially repairable, or fully

repairable systems. They can accurately model the impact of proof testing
and partial proof testing. Multiple failure modes can be modeled using as
many failure states as required. By defining separate states, failures
detected by on-line diagnostics can be distinguished from those that are
undetected. Common cause failures are also easily shown. Figure 8-2
shows the Markov model for a dual control system. It is partially repair-
able with two degraded states. Failures detected by automatic diagnostics
are distinguished (state 1) from those that are not (state 2). The models can
grow as complicated as necessary to serve the needed level of modeling
accuracy.

Markov models can be solved via a variety of techniques. Many of the

techniques require the assumption of constant failure and repair rates.
This assumption is valid for some industrial control systems but does not
work for other systems such as Safety Instrumented Systems where proof
testing must be modeled. For these systems, other solution techniques
such as numerical solutions via matrix multiplication must be used. Table
8-1 shows various solution techniques and attributes.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Table 8-1. Markov Solution Techniques and Applicability

Continuous time Steady State Discrete time
dependent solution Availability solution MTTF (State) probability solution MTTF (State)
via differential via algebraic solution via matrix via matrix solution via matrix
Markov model characteristics equations equations inversion multiplication multiplication
Model represents a fully repairable
system (regular/ergotic Markov
model) with constant failure and repair
probabilities Yes Yes Yes Yes Yes
Model represents a non-repairable
system (model has absorbing states)
with constant failure and repair
probabilities No No Yes Yes Yes
Model represents a fully repairable
system with non-constant failure or
non-constant repair probabilities No No No Yes Yes
Model represents a non-repairable
system (model has absorbing states)
with non-constant failure and repair
probabilities like a Safety
Instrumented System (SIS) in low
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

demand mode No No No Yes Yes

Continuous time-dependent Markov solution techniques (Appendix E)

are a classical approach applicable to fully repairable systems with con-
stant transition probabilities. Steady-state probability solutions are also
applicable to these systems (Ref. 3). However, these solution techniques
are not suitable for Markov models with non-constant transition probabil-
ities or states with no exit rate; therefore, as mentioned, steady-state solu-
tion techniques are not applicable for safety instrumented systems (SIS)
with proof testing. Although some have attempted to use steady-state
solution techniques on SIS applications, the results are incorrect and opti-
mistic (Ref. 4). Note that discrete time numerical solutions for state proba-
bility can be used for models with and without absorbing states, with
constant failure and repair probabilities and with non-constant failure and
repair probabilities.

Overall, the Markov approach to using discrete time numerical solution

techniques for the reliability modeling of control systems is not only flexi-
ble enough to account for the realities of the industrial environment, but it
is also a systematic approach to modeling that can reveal unexpected fail-
ure states. A discrete time Markov model can be a valuable qualitative
reliability tool as well as a quantitative tool.

Discrete Time Markov Modeling

A Markov model can be solved for time-dependent conditions. Time can
be viewed as discrete or continuous. Discrete time models change once
every time increment. The time increment depends on the model. It may
be once an hour, ten times an hour, once a day, once a week, or some other
suitable time increment. In continuous time models as in calculus, the time
increment is reduced to the limit approaching zero (the limit as t

Copyright International Society of Automation

approaches zero). In the reliability and safety modeling of industrial con-

trol systems, the discrete time approach often works well.

Time-Dependent Probabilities
Consider an industrial forging process. A forging machine stamps out
large pieces once every ten minutes (six times an hour). Records show that
1 time out of 100, the machine fails. The average repair time is 20 minutes.
A discrete time Markov model would do an excellent job of modeling this
process. A good selection for the time interval would be 10 minutes. Two
states are required; state 0, defined as success, and state 1, defined as
failure.

The system starts in state 0, success. From state 0, the system will either
stay in state 0 or move to state 1 in each time interval. There is a 1 in 100
(0.01) probability that the system will move from state 0 to state 1. For each
time interval, the system must either move to new state or stay in the
present state with a probability of one. The probability that the system will
stay in state 0 is therefore 99 out of one hundred (1 - 0.01 = 0.99). Once the
system has failed, it will either stay in state 1 (has not been repaired) or
move to state 0 (has been repaired). The probability of moving from state 1
to state 0 in any time interval is 0.5 (10 minute interval/20 minute repair
time). The system will stay in state 1 with a probability of 0.5 = (1 - 0.5).
The Markov model for this process is shown in Figure 8-5.

OK Fail
0 1

Figure 8-5. Markov Model, Forging Machine Example

Transition Matrix
The model can be represented by showing its probabilities in matrix form
(Ref. 5). An n n matrix is drawn (n equals the number of states) showing
all probabilities. This matrix is known as the Stochastic Transition Proba-
bility Matrix. It is often called the Transition Matrix, nickname P. The
transition matrix for the forging machine is written as follows:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

P = 0.99 0.01 (8-1)

0.5 0.5

Each row and each column represents one of the states. In Equation 8-1,
row 0 and column 0 represent state 0, while row 1 and column 1 represent
state 1. If more states existed they would be represented by additional
rows and columns. The numerical entry in a given row and column is the
probability of moving from the state represented by the row to the state
represented by the column. For example, in Equation 8-1, the number in
row 0, column 1 (0.01) represents the probability of moving from state 0 to
state 1 during the next time interval. The number in row 0, column 0 (0.99)
represents the probability of moving from state 0 to state 0 (i.e., of
remaining in state 0) during the next time interval. The other entries have
similar interpretations. A transition matrix contains all the necessary
information about a Markov model. It is used as the starting point for
further calculations.

Steady-State Availability
The behavior of the forging system can be seen by following the tree dia-
gram in Figure 8-6 (Ref. 6). Starting at state 0, the system moves to state 1
or stays in state 0 during the first time interval (T). Behavior during subse-
quent time intervals is shown as the tree diagram branches out. The step
probabilities are marked above each arrow in the diagram.

Certainly, one of the most commonly asked questions about a system like
this is, How much system downtime should we expect? A system is
down when it is in the failure state. With a Markov model, we can
translate this question into, What is the average probability of being in
state 1? The probability of being in state 1 can be calculated using the tree
diagram.

Using rules of probability, multiply the probabilities from each step to

obtain the individual probabilities for each path. Consider the uppermost
path (upper path). The system starts in state 0. After one time interval,
the system remains in state 0. It also remains in state 0 after time intervals
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

two, three, and four. The probability of following this path is calculated in
Equation 8-2.

P (upper path) = 0.99 0.99 0.99 0.99 = 0.960597 (8-2)

This procedure can be followed to find path probabilities for each time
interval. At time interval one, two paths exist. The upper path has one step
with a probability of 0.99. The lower path has one step with a probability

Copyright International Society of Automation

t=0
t=1

t=2 t=3 t=4

Figure 8-6. Tree Diagram

of 0.01. To find the total probabilities of being in each state after the time
interval, add all paths to a given state. For time interval one, there is only
one path to each state so no addition is necessary. The probability of being
in state 0 equals 0.99 and the probability of being in state 1 equals 0.01.
Notice that the probability of being in either state 0 or state 1 is one! (0.99 +
0.01 = 1). At any time interval, the state probabilities should sum to one.
This is a good checking mechanism.

After the second time interval, the probability of being in state 0 equals the
sum of two paths as calculated in Equation 8-3.

P (state 0 at T = 2) = 0.9801 + 0.005 = 0.9851 (8-3)

The probability of being in state 1 also equals the sum of two paths. The
same method is used repeatedly to obtain the path probabilities for time
intervals three and four as shown in Table 8-2.

Copyright International Society of Automation

Table 8-2. State Probabilities Time Interval

State 0 State 1
0 1.0 0.0
1 0.99 0.01
2 0.9851 0.0149
3 0.9827 0.0173
4 0.9815 0.0185
5 0.9809 0.0191
6 0.9806 0.0194

Notice that the values are changing less and less with each time interval. A
plot of the two probabilities is shown in Figure 8-7. The values are heading
toward a steady state. This behavior is characteristic of fully repairable
systems. If the tree diagram were fully developed, and those values
reflected in Table 8-2, it would be seen that the steady-state probability of
being in state 1 is 0.01961. This is the answer to the question about down-
time. We should expect the system to be down 1.961% of the time on the
average. For such fully repairable systems with constant failure and repair
rates, downtime is known as steady-state unavailability. Note that in
this simple example, the same number could be calculated from the failure
records if detailed maintenance records are kept.

Figure 8-7. State Probabilities Per Time Interval

The tree diagram approach is not practical for Markov models of realistic
size. However, there is another method for calculating steady-state proba-
bilities. Remember that the transition matrix, P, is a matrix showing prob-
abilities for moving from any one state to another state in one time
interval. This matrix can be multiplied by itself to get transition probabili-
ties for multiple time intervals.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

When P is squared, the result is another n n matrix that gives probabili-

ties of going between states in two steps.

P = 0.99 0.01 0.99 0.01 = 0.9851 0.0149

2
0.5 0.5 0.5 0.5 0.745 0.255

If this matrix is multiplied by P again, a matrix of three step probabilities

is obtained.

P = P P = 0.99 0.01 0.9851 0.0149 = 0.9827 0.0173

3 2
0.5 0.5 0.7450 0.2550 0.8650 0.1350

This process can be continued as long as desired to obtain the n-step prob-
ability transition matrix. For example:

P = P P = 0.99 0.01 0.9827 0.0173 = 0.9815 0.0185

4 3
0.5 0.5 0.8650 0.1350 0.9239 0.0761

P = P P = 0.99 0.01 0.9815 0.0185 = 0.9809 0.0191

5 4
0.5 0.5 0.9239 0.0761 0.9527 0.0473

P = P P = 0.99 0.01 0.9809 0.0191 = 0.9807 0.0193

6 5
0.5 0.5 0.9527 0.0473 0.9668 0.0332

After multiplying these further, notice that the result changes less and less
with each step:

= 0.99 0.01 0.98039 0.01961 = 0.98039 0.01961

18 17
P = PP
0.5 0.5 0.98039 0.01961 0.98039 0.01961

A point is reached where Pn+1 = Pn. The numbers will not change further.
This matrix, labeled PL, is known as the limiting state probability
matrix. The top and bottom row of the limiting state probability matrix are
the same numbers. The probability of going to state 0 in n steps is the same
regardless of starting state.

However, the starting state does affect the time-dependent probabilities in

early time steps, as shown in Figure 8-7. Starting state probabilities are
specified with a row matrix (a 1 n matrix), S. The row matrix is a list of
numbers that indicate the probability that the system will be in each state.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

S0 is the starting probability list (time interval 0). For example, if a system
always starts in one particular state, S0 will contain a single one and a
quantity of zeros. The forging machine example always starts in state 0.
The starting probability S would be

S 0 = [1 0] (8-4)

The Sn matrix for any particular time interval is obtained by multiplying

Sn-1 times P or S0 times Pn-1 For example:

S = S P = 1 0 0.99 0.01 = 0.99 0.01

1 0
0.5 0.5

S = S P = 1 0 0.9851 0.0149 = 0.9851 0.0149

2 0 2
0.7450 0.2550

S = S P = 0.99 0.01 0.99 0.01 = 0.9851 0.0149

2 1
0.5 0.5

S = S P = 0.9851 0.0149 0.99 0.01 = 0.9827 0.0173

3 2
0.5 0.5
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

This process can be continued as necessary.

S = S P = 0.9827 0.0173 0.99 0.01 = 0.9815 0.0185

4 3
0.5 0.5

S = S P = 0.9815 0.0185 0.99 0.01 = 0.9809 0.0191

5 4
0.5 0.5

S = S P = 0.9809 0.0191 0.99 0.01 = 0.9807 0.0193

6 5
0.5 0.5

As with P, the numbers change less and less each time. Eventually, there is
no significant change:

L
P = 0.98039 0.01961 0.99 0.01 = 0.98039 0.01961
18 17
S = S = S (8-5)
0.5 0.5

Copyright International Society of Automation

Again, the limiting state probability row, labeled SL, has been reached.
The limiting state probability matrix can be created by merely replicating
the rows as often as necessary, taking advantage of the fact that all rows in
the limiting state probability matrix are the same.

Although the matrix multiplication method is quicker than the tree

method, either can be time consuming for a realistic Markov model with
many states. Fortunately, another direct method may be used to calculate
limiting state probabilities.

The limiting state probabilities have been reached when Sn+1 multiplied
by P equals S. This fact allows an algebraic relationship to solve the prob-
lem directly. Limiting state probabilities exist when

L L = SL SL 0.99 0.01 (8-6)

S1 S2 1 2 0.5 0.5

If the matrix is multiplied, we obtain the algebraic relations:

L L L
0.99S 1 + 0.5S 2 = S 1 (8-7)

and

L L L
0.01S 1 + 0.5S 2 = S 2 (8-8)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

With algebraic manipulation it is apparent that both equations contain the

same information:

L L
0.01S 1 = 0.5S 2

The problem has two variables and only one equation; it would appear
that no further progress can be made. However, an earlier rule can help.
The probabilities in a row should always add to one. This gives the addi-
tional equation:
L L
S1 + S2 = 1 (8-9)

Substituting,
L 0.01 L
S 2 = ---------- S 1
0.5

Copyright International Society of Automation

into the second equation:

L 0.01 L
S 1 = ---------- S 1 = 1
0.5

Adding the first terms:

L
1.02S 1 = 1

Finally:
L 1
S 1 = ---------- = 0.98039
1.02
L
S 2 = 1 0.98039 = 0.01961

Of the three different methods presented for determining limiting state

probabilities, the direct algebraic method offers the quickest solution. The
method is applicable to all fully repairable systems with constant failure
rates and constant repair rates. These systems have Markov models that
allow eventual movement from any state to any other state. Markov mod-
els of this type are technically called regular Markov models.

Steady-State Availability/Steady-State Unavailability

Once the limiting state probabilities of a Markov model are known, it is a
simple matter to calculate steady-state availability or steady-state unavail-
ability. Each state in the fully repairable Markov model represents either

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
system success or system failure. All failure probabilities and repair prob-
abilities are assumed to be constant. To calculate steady-state availability,
identify the system success states and sum their probabilities. The sum of
the failure state probabilities provides the steady-state unavailability.

One success state, state 0, is present in the forging process example (Figure
8-4). Thus, steady-state availability for this forging process is 0.98039 or
98.039%. One failure state exists, state 1; therefore, steady-state unavail-
ability is 0.01961 or 1.96%.

Copyright International Society of Automation

EXAMPLE 8-1

Problem: A control system has the Markov model shown in Figure 8-

8. The process is shut down when the controller is in state 3.
Therefore, state 3 is classified as system failure. States 0, 1, and 2
are classified as system success. Failure probabilities and repair
probabilities are shown on the Markov model on a per-hour basis.
What is the expected system downtime?

1

OK Fail
0 3

2

Figure 8-8. Control System Markov Model

Solution: The transition matrix is

0.996 0.002 0.002 0

P = 0.1 0.899 0 0.001
0.1 0 0.899 0.001
0.01 0 0 0.99

Solving for limiting state probabilities, the results are

L
S0 = 0.958254

L
S1 = 0.018975
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

L
S2 = 0.018975

L
S3 = 0.003794

Since the system is down (failed) in state 3, the predicted average

steady-state downtime is 0.3794%.

Copyright International Society of Automation

EXAMPLE 8-2

Problem: Calculate steady-state availability of the control system

from Example 8-1.

Solution: The control system is successful in states 0, 1, and 2;

therefore, we add the limiting state probability of the success states.

L L L
A ( s ) = S0 + S1 + S2
= 0.958255 + 0.018975 + 0.018975
= 0.996205

Time-Dependent Availability
When a discrete time Markov model begins at t = 0 in a particular starting
state, the availability will vary with time. Figure 8-7 showed this behavior
for the forging system example. If the system is fully repairable with con-
stant repair probabilities, the availability and unavailability will eventu-
ally reach a steady-state value. Until steady-state is reached, the
availability and unavailability values may change in significant ways.

We have already developed two different methods of calculating time-

dependent availability: the tree diagram/path probability method and the
S/P multiplication method. The S/P method is vastly more practical, espe-
cially since many calculators and all personal computer spreadsheet pro-
grams are capable of matrix multiplication.

To calculate time-dependent availability, we follow the previously devel-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

oped procedure of multiplying the state probability state row, S, by P. At

each time step we add the probabilities of the success states to obtain
availability. We add the probabilities of the failure states to obtain
unavailability.

EXAMPLE 8-3

Problem: Calculate the time-dependent availability for the system

described in Example 8-1. The system always starts in state 0.

Solution: The state probabilities after step one (one hour) are

0.996 0.002 0.002 0

1
S = 1 0 0 0 0.1 0.899 0 0.001 = 0.996 0.002 0.002 0
0.1 0 0.899 0.001
0.01 0 0 0.99

Copyright International Society of Automation

EXAMPLE 8-3 continued

The availability at t = 1 hour is the sum of states 0, 1, and 2. This

equals 1.0. Continuing this process, we obtain availability and
unavailability for each time increment.

For step two, the state probabilities are

0.996 0.002 0.002 0

2
S = 0.996 0.002 0.002 0 0.1 0.899 0 0.001
0.1 0 0.899 0.001
0.01 0 0 0.99

= 0.992416 0.00379 0.00379 0.000004

Availability equals the sum of the probabilities for states 0, 1, and 2.

This equals 0.999996. Continuing the process, availability is plotted
in Figure 8-9 and unavailability in Figure 8-10.

1
0.999
Probability

0.998
0.997
0.996
0.995
100 200 300 400 500
Time Increment
Figure 8-9. Availability (t)

It can be seen that availability decreases from a value of 1.0 toward the
steady-state value (0.996205 from Example 8-2) as time passes. The time
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

increments for this model are one hour. The numbers on the time line
shown in Figure 8-9 and Figure 8-10 are hours.

Absorbing States
Sometimes a system fails in such a way that it cannot be repaired. An obvi-
ous example of this is when a failure causes a destructive major explosion.
In other situations, it is desirable to model failure behavior of a system for

Copyright International Society of Automation

0.004
0.0035

Probability
0.003
0.0025
0.002
0.0015
0.001
0.0005
0
100 200 300 400 500
Time Increment
Figure 8-10. Unavailability (t)

a specific time period during which certain failures cannot be repaired

without major economic impact. That time period might be the time it
takes an industrial process to run a campaign of product batches, or the
time between scheduled plant shutdowns when the control system is
inspected and overhauled as necessary.

A Markov model of such system behavior would show one or more failure
states from which there is no exit. State 1 in Figure 8-3 is an example. Such
states are known as absorbing states. They are typically applied when-
ever a failure occurs for which there is no feasible repair during the time
period of interest.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Time-Dependent Reliability
Reliability has been defined as the probability of system success over a
time interval. This definition of reliability does not allow for system
repair. This fits in perfectly with systems that cannot be fully repaired at
the system level.

Repairs can be made, however, below the system level. A system that has
a discrete time Markov model with more than one success state may move
between those states without altering system reliability. Component or
module failures and subsequent repairs that cause movement only
between system success states do not cause system failure. When a com-
ponent or module failure causes the system to move from a success state
to an absorbing failure state, system failure occurs.

Reliability for such systems can be calculated for any time interval using
methods similar to those used to calculate availability as a function of

Copyright International Society of Automation

time. The initial probability state row is multiplied by P to obtain a proba-

bility state row for each time step. The probabilities from all successful
states are added to obtain reliability as a function of time. The probabilities
from all failure states are added to obtain unreliability as a function of
time.

1

OK Fail
0 3

2

Figure 8-11. Partially Repairable Control System Markov Model

EXAMPLE 8-4

Problem: A partially repairable control system has the Markov model

shown in Figure 8-11. The process is shut down when the controller
is in state 3 (absorbing state). Once the process is shut down, it
cannot be restarted without a complete overhaul of the process
equipment. There is a scheduled shutdown once a month; therefore,
we are interested in a plot of reliability for a one-month time period.
Failure probabilities and repair probabilities are shown on the Markov
model on a per-hour basis. The system always starts in state 0.

Solution: Using a discrete time Markov model with a one hour time
increment, reliability is calculated by multiplying each successive S
by P. The transition matrix for this model is:

0.996 0.002 0.002 0

P = 0.1 0.899 0 0.001
0.1 0 0.899 0.001
0 0 0 1

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 8-4 continued

The state probabilities for the first hour are given by:

0.996 0.002 0.002 0

1 0
S = S P = 1 0 0 0 0.1 0.899 0 0.001
0.1 0 0.899 0.001
0 0 0 1

= 0.996 0.002 0.002 0

Adding the probabilities from states 0, 1, and 2, the reliability for the
first hour equals 1.0.

Continuing the process, the reliability values for 750 hours are
calculated. These are plotted in Figure 8-12.

1
Probability

0.99
0.98
0.97
0.96
0.95
100
133
166
199
232
265
298
331
364
397
430
463
496
529
562
595
628
661
694
727
34
67
1

Time Increment
Figure 8-12. Reliability (t), Partially Repairable Control System

Mean Time to Failure

There is a need for a single measure of success for systems modeled with
absorbing states. The MTTF metric is used for this purpose by many. In
terms of a discrete time Markov model, the mean time to failure is repre-
sented by the average number of time increments between system startup
and system failure over many independent starts.

If many computer simulations of a discrete time Markov model were run,

counting the time increments between start and failure (absorption), we
could average the time increment numbers obtained. That average num-
ber represents the MTTF for the system being modeled.

It is not necessary to run computer simulations and average the results in

order to obtain the MTTF. The MTTF for a Markov model with constant
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

transition probabilities can be calculated from the transition matrix. The

method is well documented (Ref. 6). The first step is to create a truncated
matrix that contains only the transient states of the system. This is done by
crossing out the rows and the columns of the absorbing states. Using the
control system defined in Example 8-4, the truncated matrix, called the Q
matrix, is:

0.996 0.002 0.002

Q = 0.1 0.899 0 (8-10)
0.1 0 0.899

The Q matrix is subtracted from the Identity Matrix, known as I.

1 0 0 0.996 0.002 0.002

I Q = 0 1 0 0.1 0.899 0
0 0 1 0.1 0 0.899

0.004 0.002 0.002

= 0.1 0.101 0 (8-11)
0.1 0 0.101

Another matrix, called the N matrix, is obtained by inverting the (I Q)

matrix. Matrix inversion can be done analytically for small matrices, but it
is impractical to do for large realistic matrices that represent more com-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
plex systems. A shortcut is available, however. Many spreadsheet pro-
grams in common use have the ability to numerically invert a matrix. This
tool can be used to make quick work of previously time-consuming MTTF
calculations.

A numerical matrix inversion of the (I - Q) matrix using a spreadsheet pro-

vides the following:

25,250 500 500

1
[I Q] = N = 25,000 504.95 495.05 (8-12)
25,000 495.05 504.95

The N matrix provides the expected number of time increments that the
system dwells in each system success state (a transient state) as a function
of starting state. In our example, the top row states the number of time
increments per transient state if we start in state 0. The middle row gives
the number of time increments if we start in state 1. The bottom row states
the number of time increments if we start in state 2. If a system always

Copyright International Society of Automation

starts in state 0, we can add the numbers from the top row to get the total
number of time increments in all system success states. When this is multi-
plied by the time increment, we obtain the MTTF when the system starts
in state 0. In our example, this number equals 26,250 hours since we used a
time increment of one hour. If we started this system in state 1, we would
expect the system to fail after 26,000 hours on the average. If we should
start the system in state 2, we would also expect 26,000 time increments to
pass until absorption (26,000 hours until failure).

Time-Dependent Failure Probability

Some components do not exhibit a constant failure probability as a func-
tion of operating time interval. As seen in Chapter 4, some components
have failure rates that vary with time in a manner that appears like a
bathtub when plotted. Such changes in failure rate can have a drastic
effect on reliability or availability plots especially if the time period of
interest in the model goes into the wearout region of any component.

A time-dependent failure probability can be accommodated in a discrete

time Markov model by changing the P as required each time interval. In
one approach, P is a full three dimensional matrix where a new P is used
for each time interval. In other approaches, each variable element in P is
characterized by a piecewise linear equation. To account for wearout, two
segments are required: a constant segment for the first portion of time and
a linearly increasing segment thereafter. The ability to accommodate time-
dependent failure rates is one of the great strengths of the discrete time
Markov model approach. As the need for realistic reliability modeling
becomes stronger, such abilities will grow in importance.

Time-Dependent Restore Probability

Modeling a system with a constant probability for a restore operation is
acceptable for failures that are immediately detectable by obvious system
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

failure or automatic diagnostics. But in safety instrumented systems, some

failures are only detectable with a periodic manual inspection and test
process called a proof test. In these situations the restore probability is
not at all constant; it is time-dependent.

This situation can also be accurately modeled with a discrete time Markov
model using numerical solution techniques. Consider the case of a safety
instrumented system with periodic proof testing, which typically includes
an inspection and test, or series of tests. While the time increments are
counting and have not reached the test time, the probability of a restore
operation from a fail-danger state is zero because the failure is not known.
When the time increment reaches the periodic test time, a proof test is per-
formed and repair is made if a failure is detected by the test. If the system

Copyright International Society of Automation

is shown to be working properly by a perfect proof test, the probability of

failure is zero. This is modeled with a discrete time numerical Markov
solution by substituting a P matrix that restores the system to the fully
operable state at the test time. Consider the multiple failure state Markov
model of Figure 8-13.

LS$t System Fail

Fail-
DeEnergize
2
MS$t LS$t
System OK LD$t
System OK
0 Degraded LD$t
1 System Fail
MO$t Fail-
MP$t Energize
3

Figure 8-13. Simplified 1oo2 Markov Model

Figure 8-13 model shows a simplified 1oo2 system without common cause
or diagnostics. (For a complete model of a 1oo2 system, see Chapter 14.)
Failure rates with superscript S are fail-safe and cause an output to de-
energize. Failure rates with superscript D are fail-danger and cause an
output to energize. Restore rates for failure detected by automatic diag-
nostics have the subscript O indicating an on-line restore. Restore rates
with subscript P indicate a restoration only when periodic proof test
detects the failure. The periodic restore arc from state 3 to state 0 is not
constant. At test time, the system either works correctly or it is repaired in
a finite period of time. At the end of the test time, the system is restored to
state 0. The probability of moving from state 3 to state 0 is a time-depen-
dent function as shown in Figure 8-14.

This model can be solved using discrete time matrix multiplication for the
case where a perfect periodic proof test is done to detect failures in state 3.
The P matrix is normally:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Probability of moving from state 3 to state 0 1

Proof Test Interval Proof Test Interval Proof Test Interval

0
Operating Time Interval Time increments

Figure 8-14. Probability of Moving from State 3 to State 0

D S D S
1 ( 2 + 2 )t 2 t 2 t 0
S D S D
O t 1 ( O + + )t t t
S t 0 0 0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

0 0 0 1

When the time increment counter equals the end of the proof test and
repair period the matrix is changed to represent the known probabilities of
failure. The P matrix used then is:

1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0

This matrix represents the probability of detecting and repairing the fail-
ure. The 1 indicates the assumption of perfect proof testing and repair.

When this approach is used, the probability of being in state 3 (PFD) as a

function of time is shown in Figure 8-15. Note that an assumption has
been made that the system does not provide safety functionality during
proof testing. That may be pessimistic as some proof test procedures do
not disable the safety function during the test.

Copyright International Society of Automation

Probability of being in state 3 (PFD) 1

0
Operating Time Interval Time increments

Figure 8-15. Probability of Being in State 3 (PFD)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Modeling Imperfect Proof Test and Repair
In the previous analysis, an assumption was made that is not realistic. The
assumption of perfect proof test and repair is made with many safety
integrity (i.e., low demand SIS verification) models because of the limita-
tions in the modeling tool. When solving the models using discrete time
matrix multiplication with a Markov model, this is easily solved. To
include the effect of imperfect proof test and repair, the failure rate going
to the fail-energize (absorbing) state can be split into failures detected dur-
ing a periodic proof test and those that are not. The split is based on proof
test effectiveness, which can be determined with a FMEDA (Chapter 5).
The upgraded Markov model is shown in Figure 8-16.

The P matrix is normally:

D S D S
1 ( 2 + 2 )t 2 t 2 t 0 0
S D S D D
O t 1 ( O + + )t t E t ( 1 E ) t
S t 0 0 0 0
0 0 0 1 0
0 0 0 0 1

When the time counter equals the end of the proof test and repair period
the matrix is changed to represent the known probabilities of failure. The
P matrix then used is:

Copyright International Society of Automation

LS$t System Fail

Fail-
DeEnergize
2
MS $t LS$t
System OK LD$t System Fail
System OK %LD$t Fail-
0 Degraded Energize
MO$t 1 3

MP $t System Fail

% LD$t Fail-
Energize
4

Figure 8-16. Upgraded Markov Model for Imperfect Proof Testing

1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 1

This matrix indicates that all failures are detected and repaired except
those from state 4, where they remain failed. A plot of the PFD (state 3 or
state 4) as a function of operating time interval is shown in Figure 8-17.
The dotted line at the bottom of the figure shows the impact of the failures
that are not detected during the proof test. Those undetected failures
cause increasing PFD as a function of operating time interval.

Modeling Notation
It is common practice not to display the t in Markov model drawings or
in matrix descriptions. When this is done, the drawing in Figure 8-16
would look like Figure 8-18. It is understood that a time increment is being
used in the solution of such drawings and matrix descriptions.
--``,,`,,,`,,`,`,,,```,,,``,``,

Copyright International Society of Automation

Probability of state 3 or state 4 (PFD) 1

0
Operating Time Interval Time increments

Figure 8-17. Probability of Being in State 3 or State 4 (PFD)

LS System Fail

Fail-
DeEnergize
2
MS LS
System OK LD System Fail
System OK %LD Fail-
0 Degraded Energize
MO 1 3

MP System Fail

% LD Fail-
Energize
4

Figure 8-18. Shortcut Notation for the Markov Model of Figure 8-16

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Exercises
8.1 A system contains three modules. Each module can be either oper-
ating successfully or failed. How many possible states may exist in
the Markov model for this system?
8.2 A system has the Markov model shown in Figure 8-19. The system
is fully repairable. The arcs are labeled with probabilities in units
of failure probability per hour. The system is successful in states 0
and 1. Calculate the limiting state probability row. What is the
steady-state availability?

System System System

Successful Successful Failed

OK
1 2
0

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 8-19. Markov Model for Exercise 8.2

8.3 Using the system modeled in Exercise 8.2, calculate availability as

a function of time. Use one-hour time increments. How many
hours are required for the system to reach a steady-state value
(within 6 digits)?
8.4 Should steady-state Markov solutions be used for safety integrity
(i.e., low demand SIS verification) calculations?
8.5 Can Markov models be used to model proof test effectiveness?

Copyright International Society of Automation

Answers to Exercises
8.1 23 = 8
8.2 The P matrix is
To 0 1 2
From 0 0.9898 0.01 0.002
1 0.05 0.945 0.005
2 0 0.05 0.95

Limiting state probabilities are calculated to be

State 0 0.81407
State 1 0.16607
State 2 0.01986
Steady-State Availability = 0.98014
8.3 The problem may be solved using Microsoft Excel. A starting
row matrix is entered with values of 1, 0, 0. This matrix is multi-
plied by the P matrix. The matrix multiple command is formatted
as [mmult(XX:YY,WW:ZZ)]. (Dont forget CNTL-SHIFT-ENTER
instead of just ENTER.) In the solution spreadsheet, steady state is
reached somewhere in the 270 to 310 hour range, depending on the
criteria used for steady state.
8.4 In order to obtain a steady-state solution for a Markov model the
transition probabilities must be constant. Although an approxima-
tion can be reasoned, the results have been shown to be optimistic
(Ref. 4) and therefore steady-state Markov solutions should not be
used for low demand SIS verification calculations.
8.5 Yes, Markov models solved with discrete time numerical solutions
can be used to accurately model proof test effectiveness, repair
times, and other complex variables that represent reality.

References
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

1. Rouvroye, J. L., Goble, W. M., and Brombacher, A. C. A Compari-

son Study of Qualitative and Quantitative Analysis Techniques for
the Assessment of Safety in Industry. Proceedings of the Probabilis-
tic Safety Assessment and Management Conference (ESREL '96 - PSAM
III), June 1996.

2. The Academic American Encyclopedia (Electronic Version). Danbury:

Grolier Electronic Publishing, Inc., 1991.

Copyright International Society of Automation

3. Bukowski, J. V. Modeling and Analyzing the Effects of Periodic

Inspection in the Performance of Safety-Critical Systems. IEEE
Transactions on Reliability, Volume 50, Number 3. NY: IEEE, Sep.
2001.

4. Bukowski, J. V. A Comparison of Techniques for Computing PFD

Average. Proceedings of the Annual Reliability and Maintainability
Symposium, IEEE, 2005.

5. Maki, D. P. and Thompson, M. Mathematical Models and Applica-

tions. Englewood Cliffs: Prentice-Hall, Inc., 1973.

6. Billinton, R. and Allan, R. N. Reliability Evaluation of Engineering

Systems: Concepts and Techniques. NY: Plenum Press, 1983.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Improving Safety and MTTF

The availability and safety of automatic control systems depend on a num-
ber of important factors. One of the most important factors in redundant
and non-redundant architectures is the ability of a system to detect com-
ponent failureson-line diagnostics. Good on-line diagnostics improve
safety and availability (Ref. 1, 2). The importance of diagnostic capability
was pointed out clearly in a 1969 paper by Bouricius, Carter, and
Schneider as part of work done on fault tolerant computing for the U.S.
Space program (Ref. 3) where the term diagnostic coverage factor, C
was defined. Diagnostic coverage factor is the probability (a number
between 0 and 1) that a failure will be detected given that it occurs.

All repairable control systems (redundant and non-redundant) have

improved availability and safety when on-line (automatic) diagnostics are
added. Depending on specific architecture, other benefits include:

The amount of time the system operates in a dangerous mode can

be reduced.
The amount of time spent in degraded (not fully operational) mode
is reduced.
The diagnostics directly affect system operation to improve both
safety and availability.

Diagnostic Coverage Effects

The significance of diagnostic capability in a redundant system can be
seen by first looking at an ideal dual architecture using a Markov model

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`--- 179
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
180 Control Systems Safety Evaluation and Reliability

(Figure 9-1). The model assumes perfect diagnostic coverage, no common-

cause failure, no common components, perfect repair, and no switching
mechanism. State 0 represents successful operation of both controllers. If
one controller fails, the system moves to state 1. Since two controller mod-
ules are successfully operating in state 0, the failure probability is 2 .
From state 1, a repair will return the system to state 0. A failure from state
1 will move the system to state 2. In state 2 the system has failed.

Figure 9-1. Markov Model of Ideal Dual Controller

The system MTTF (Mean Time to Failure) can be calculated using matrix
algebra. Since MTTF is a measure that does not consider system failure
and subsequent repair, the model is modified by eliminating the repair arc

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
from state 2 to state 1. The P matrix for the modified model is:

1 2 2 0
P = o 1 ( + o )
0 0 1

Using this new P matrix, solve for MTTF by following the steps detailed in
Chapter 8 (see Appendix B for Matrix math). Assume the system starts in
state 0. After inverting the (I - Q) matrix and adding the top row (reference
Appendix B for details of the derivation):

3 +
MTTF = ---------------- (9-1)
2
2

The previous analysis made a number of idealistic assumptions that do

not reflect a real redundant control system implementation. One of these
assumptions is all failures are detected. In many simple reliability analysis
problems, perfect coverage (100%) is assumed. If perfect coverage (detec-
tion by on-line diagnostics) is not assumed, failures that are covered must

Copyright International Society of Automation

be distinguished from those that are not. This distinction must be made
because diagnostic coverage affects repair time.

If a failure is detected by the automatic diagnostics and annunciated, it can

be repaired immediately. If a failure is not detected by the diagnostics, it
may remain completely hidden. Consider the short circuit failure of an
output transistor on a controller module. If the transistor is in a circuit
configuration that normally conducts, there is no difference between nor-
mal operation and short circuit failure. These types of failures will likely
remain undiscovered until the circuit tries to open.

The on-line restore rate resulting from immediately detected failures is

1
o = ------- (9-2)
TR

The variable TR refers to average restore time. The on-line restore rate
applies to all failures that are covered (detected by on-line diagnostics).

In Figure 9-2, a new Markov model is presented that accounts for the dif-
ference between detected failures and undetected failures in the degraded
(not fully operational) state. From state 0, a detected failure will take the
system to state 1. Repairs are made from state 1 to state 0 at the on-line
repair rate. An undetected failure will take the system to state 2. When the
system is in state 1 or in state 2, one controller is operating. From these
states another failure will cause system failure.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 9-2. Markov Model Showing Coverage Impact on Degraded State

Copyright International Society of Automation

An analysis of the model using the matrix inversion method presented in

Chapter 8 (see Appendix B for Matrix math) gives us an equation for
MTTF:

3 + 3 o 2C o
MTTF COV = --------------------------------------------------- (9-3)
2
2 + 2 o 2C o

Figure 9-3 shows how MTTF varies as a function of diagnostic coverage

using the values = 0.0005 failures per hour and TR = 8 hr. MTTF drops
very sharply as coverage is reduced. When coverage drops from 1.0 to
0.94, MTTF drops by an order of magnitude. While most of the drop
occurs in the first two steps, MTTF continues to drop rapidly until cover-
age is reduced below 0.95. It is very clear diagnostic coverage is a variable
with huge impact on the MTTF of a redundant system.

Figure 9-3. MTTF versus Diagnostic Coverage

Imperfect Switching
Another assumption made in the ideal model of Figure 9-1 was no switch-
ing mechanism is present. Many practical implementations of redundancy
include a switching mechanism to select the appropriate module output.
The new system is drawn in Figure 9-4. It has an output selector switch
that chooses which module output to route to the system output.

The successful operation of the output selector switch depends on good

diagnostics. This is an example of a form of fault tolerance where actual
system operation depends on diagnostic output. Assume each module

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 9-4. Block Diagram of Redundant Controller with Output Switch

generates a logic 1 signal if it detects a failure. The switch logic looks at

both signals. If both signals are logic 0, the switch may select either output,
as both are deemed correct. If one signal is logic 0, that output will be
selected. If both signals are logic 1, the controller modules have both
failed. The switch should act in the safe manner. (In many systems the
switch either holds the output or simply de-energizes.)

A Markov model of the system is shown in Figure 9-5. As long as a mod-

ule failure is detected, the diagnostic output from a module will properly
control the switch. The system moves from state 0 to state 1.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 9-5. Markov Model of Dual System with Switch

If the failure is not detected by the diagnostics, one of two things will
happen.

Copyright International Society of Automation

1. If the switch is selecting the module in which the undetected

failure has occurred, the system will fail immediately. The Markov
model has an arc from state 0 to state 3 to show this action. Notice
that the model includes a failure rate for the switch. That failure
also causes immediate system failure.

2. If an undetected failure occurs in one module and the switch is

using the other module, the system moves to state 2 where an
undetected failure is present. It is reasonable to assume that the
system has an equal chance of being in either switch position when
the undetected failure occurs. Therefore, we split the undetected
failure rate evenly between the two alternatives. A plot of MTTF

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
versus coverage for this system is shown in Figure 9-6. The effect
of diagnostic coverage is again critical because one half of the
undetected failures cause immediate failure.

Figure 9-6. MTTF versus Diagnostic Coverage

Diagnostic Coverage Effects on Redundant Architectures

Safety, as measured by PFD (probability of failure on demand), PFDavg or
RRF (risk reduction factor), is improved by diagnostic coverage even in a
non-redundant architecture. Consider a normally energized safety protec-
tion application. If the conventional 1oo1 PLC (one out of one, Figure 9-7)
architecture fails with outputs de-energized, the process is inadvertently
shut down. This is called a false trip. Most processes do not require on-line
diagnostics to detect when the process has shut down. A false trip is usu-
ally quite apparent. If, however, the 1oo1 fails with its output energized
the PLC cannot respond to a dangerous condition. The process keeps

Copyright International Society of Automation

operating with no safety protection and no indication that something is

wrong.

Figure 9-7. 1oo1 PLC Architecture

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Diagnostics can detect these dangerous failures and allow the system to be
quickly repaired. There is a significant improvement in RRF when such a
system has good diagnostics (and plant maintenance policies that ensure
reasonably quick repair).

For architectures with redundancy, diagnostics reduce time spent in a

degraded mode. The 1oo2 architecture (one out of two, Figure 9-8) was
designed to tolerate dangerous failures (in a normally energized protec-
tion application). The outputs of two PLC modules are wired in series. If
one module fails with its outputs energized, the other module can still de-
energize the load and provide a safety protection function.

Figure 9-8. 1oo2 PLC Architecture

Diagnostics improve the safety of this architecture as well. If one module

fails dangerously, the system is degraded, and a second dangerous failure
will cause the system to fail. Diagnostics capable of detecting a dangerous
failure will allow repair to be made more quickly (presuming, again, that
plant maintenance policies ensure reasonably quick repair). This will min-
imize the amount of time this system operates in a degraded mode. Figure
9-9 shows a plot of RRF versus dangerous failure diagnostic capability.

Copyright International Society of Automation

The RRF calculations assume a repair time of 24 hours. The plot shows
RRF is highest when the dangerous diagnostic coverage is 100%.

900

RRF - 1oo2 Architecture

880
860
840
820
800
780
760
740
720
700
1 0.9 0.8 0.7 0.6

Dangerous Coverage Factor - C D

Figure 9-9. 1oo2 RRF versus Dangerous Diagnostic Coverage Factor

In all fault tolerant architectures, diagnostics can be used to improve

safety and availability. The ability of a control system to automatically
detect component failures is important and must be measured to properly
evaluate system availability and safety metrics.

Measuring Diagnostic Coverage

Since effective automatic diagnostics are one of the major factors in safe,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

reliable control systems, the ability to measure and evaluate those diag-
nostics is important. This is done using a FMEDA (Chapter 5) and verified
with testing that simulates failures and records diagnostic performance.

The definition of diagnostic coverage factor includes the probability of

failure. Failures that are more likely to occur count more than failures than
are less likely to occur. As an example, consider a system with three com-
ponent/failure modes. Component one is detected by the on-line diagnos-
tics and has a failure rate of 0.0998 failures per hour. The other two
components are not detected by the on-line diagnostics and have a failure
rate of 0.0001 failures per hour. The total failure rate is 0.1 failures per
hour. The diagnostic coverage factor is calculated by adding the failure
rates of detected components and dividing by the total failure rate of all
components. In this example, the coverage factor equals 0.998 or 99.8%.
The coverage factor does not equal 0.33 as might be assumed. Coverage is
not calculated by counting the number of detected failures and dividing
by the total number of failures.

Copyright International Society of Automation

EXAMPLE 9-1

Problem: An electronic circuit has the FMEDA chart shown as Table

9-1. What is the overall coverage factor?

Table 9-1. FMEDA Chart

Solution: 12 of the 22 possible component/failure modes are

detectable. The total detected failure rate is 80.03 FITS (failures per
billion hours). The total failure rate is 93.56 FITS. The overall
coverage factor is 80.03/93.56 = 0.855.

Failure Mode Diagnostics

In reviewing the architectures one realizes the ability to detect dangerous
failures is the most important characteristic in all automatic control and
safety applications but especially important in Safety Instrumented Sys-
tem products. As discussed in Chapter 4, for safety PLCs, it is generally
necessary to distinguish the coverage factor for safe failures from the cov-
erage factor for dangerous failures. The superscript S is used for the safe

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

coverage factor, CS. The superscript D is used for the dangerous coverage
factor, CD.

EXAMPLE 9-2

Problem: For Example 9-1, what is the dangerous coverage factor?

What is the safe coverage factor?

Solution: The total safe failure rate in Table 9-1 is 70.04 FITS. The
safe detected failure rate is 65.02 and therefore the safe coverage
factor = 0.928. The total dangerous failure rate is 23.5 FITS. The
dangerous detected failure rate is 15 FITS. The dangerous coverage
factor is 0.64. This circuit is based on a conventional PLC input circuit
with added diagnostics. Many would judge such conventional PLC
circuits to be insufficient for safety applications.

Measurement Limitations
Although the FMEDA technique can provide good diagnostic coverage
factor numbers when done accurately, the main limitation is that the
FMEDA creator(s) and its reviewers must know about all possible failure
modes of the components used in the circuit/module/unit. Should unan-
ticipated (unknown) failure modes occur, they may or may not be
detected by the on-line diagnostics as designed. Fortunately, the failure
rates of those failure modes is likely to be very small or someone would
know about them. This is especially true on components that are used in
many applications. The possibility of unknown/undetected failure modes
is higher for new components. When the possibility of unknown failure
modes is considered likely this can be indicated on the FMEDA as shown
in Table 9-2. This is the FMEDA of Table 5-4 with an unknown failure
mode added. Notice the dangerous diagnostic coverage dropped from
99.96% to 99.73%.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Table 9-2. FMEDA Chart

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Diagnostic Techniques
As control computer capability increases, we expect better diagnostic cov-
erage. The computer HAL from the movie 2001 said, Ive just picked
up a fault in the AE35 unit. Its going to 100% failure within 72 hours.
While our newest machines are not yet quite at this level, new automatic
diagnostic techniques are constantly being developed and used to
improve diagnostic coverage.

An excellent level of coverage can be achieved with careful design. Even

early microcomputer systems used simple watchdog timer circuits, a diag-
nostic technique that is still in use. These circuits consist of re-triggerable
timing devices. As long as the system triggers the timer within the timeout
period, the timer is reset and does not time out. If something in the system
fails in such a way that the timer is not re-triggered, the timer does time
out and indicates the failure. These simple mechanisms provide some cov-
erage. It is estimated the coverage from such devices is between 50% and
90% (Ref. 4).

High coverage factors (C > 95%) are hard to achieve. Controllers must be
designed from the ground up with self-diagnostics as a goal. Electronic
hardware must be carefully designed to monitor the correct operation of
each circuit. Software must properly interact with the hardware. In the
past, control computers have been estimated to provide 93% to 95% cover-
age [Ref. 5 and 6]. More recent designs have achieved diagnostic coverage
factors greater than 99% [Ref. 7 and 8]. To achieve these levels of coverage,
a number of diagnostic techniques have been developed. They can be clas-
sified (Chapter 4) in two ways: a comparison to a predetermined reference
and a comparison to a known good operating unit.

The diagnostic technique where a comparison is made between an operat-

ing value and a predetermined reference is called reference diagnostics.
Reference diagnostics can be done by a single operating unit. The cover-
age factor of reference diagnostics varies widely depending on the imple-
mentation with most results ranging from 0.6 to 0.999.

The comparison of two operational units is called comparison diagnostics.

This, of course, requires two or more operating units. The coverage factor
again depends on implementation, but results are generally good with
most results ranging from 0.9 to 0.999.

Reference Diagnostics
A comparison to a predetermined reference is the most commonly used
diagnostic technique. The auto mechanic measures oil pressure, wet and
dry compression pressure, or mechanical clearances. Then these results
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

are compared with the manufacturers specifications in order to determine

if anything is wrong with the autos engine. The TV repairman would
retrieve the Sams-Fotofact schematic and compare the voltage levels and
waveforms to the references on the schematic. The same techniques are
used in a control computer.

Certain results are expected during system operation. An analog-to-digital

converter should operate within 15 microseconds after receiving a convert
command. A data received signal should acknowledge within two clock
cycles after a store operation. A calibration adjustment should never
exceed 10 percent. The noise level of an analog signal should always be
greater than 2 bits. The data tables in memory should always start with a
zero. These expected results can be stored in the computer program and
compared to actual operation. Many diagnostic algorithms use the com-
parison to reference technique. Time period measurement is one of the
most common uses.

A watchdog timer (as described above) is a simple diagnostic tool. Any

failure that prevents the circuit from finishing its cycle in the allotted time
will be detected by the watchdog. A windowed watchdog is a little
more sophisticated. The timer looks for pulses within the stated time
period and resets the timer. Pulses before or after the time period do not
reset the timer. This technique will detect a wider range of failures within
the system (Figure 9-10). The method is especially effective in detecting
failures in a computer that executes a program on a periodic basis. System
software must operate correctly to maintain the time period.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 9-10. Windowed Watchdog Operation

Analog-to-digital converters are useful for more than process input mea-
surement. Certain voltages and currents within a module indicate failure.
Circuits can monitor the voltage of any power source. If a component fail-
ure causes the supply voltage to exceed bounds, the failure is detected.

Copyright International Society of Automation

Power consumption can also be measured, and many failures are indi-
cated by an increase in current.

Computer memory can be tested by storing a checksum for any memory

block whenever a memory write occurs. The memory can be re-read and
the checksum recalculated. Known failure modes of memory are detected
if the checksums do not match.

Many digital circuits use known sequences of bit patterns. If the sequence
of binary numbers is added, a sum results. The same number should be
calculated every time the sequence repeats. If the number is different, a
failure has occurred.

Input/output circuitry is present on many modules used in industrial con-

trollers. Special techniques are necessary to obtain good coverage in such
circuits. Output read-back is one of the first techniques used (Figure 9-11).
An input circuit is used to read the result of all output circuits. This allows
output switch failures to be detected whenever differences are discovered.
This works well on dynamic signals, i.e., signals whose values change fre-
quently. It will only detect failures in one mode for signals that are static.

Figure 9-11. Output Readback Diagnostic --``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Output current sensors can be used to detect open and short circuits in
output devices. If the current rises above an upper bound, a short failure is
detected. Outputs can be pulsed briefly to verify that the output is able to
switch. The normal output state is restored automatically as soon as the
switch is verified as good. The I/O power can be monitored, which allows
detection of failed I/O power or possibly a failed cutoff switch.

Component failures in input circuits are hard to detect, especially when

the input signals are static. In the case of static discrete input switches that
are normally closed, a pulsed output (Figure 9-12) can be used to generate
a signal read by the input circuit. The software looks for a pulsing signal to
indicate the switch is closed. If the pulse stops, either the switch is open or

Copyright International Society of Automation

the input circuit has failed. This diagnostic method will detect stuck-at-
one and stuck-at-zero input circuit failures.

Figure 9-12. Pulse Test Input Diagnostic

Analog input signals have better diagnostic potential than discrete signals.
In normal situations, an analog signal varies. One good diagnostic tech-
nique is the use of a stuck signal detector. If the analog signal gives the
exact same reading for several scans in a row, it has probably failed.

Comparison Diagnostics
A comparison to another operating unit is also useful. Results of dynamic
calculations can be compared. In a dual configuration, any disagreement
may identify a fault. In a triple configuration, a voting circuit is used to
identify when one unit disagrees with the other two. It is likely the dis-
agreeing unit has failed. This can be a useful diagnostic technique for
many failures.

PLC comparison diagnostic techniques depend on comparing some data

between two or more PLCs. The concept is simple. If a failure occurs in the
circuitry, processor, or memory of one PLC, there will be a difference
between data tables in that unit when compared to another unit. Compar-
isons can be made of input scans, calculation results, output readback
scans, and other critical data. As mentioned in Chapter 4, the comparison
coverage factor will vary since there are tradeoffs between the amount of
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

data compared and coverage effectiveness. In general, the more data com-
pared, the higher the effectiveness.

Field Device Coverage

Diagnostics should be extended to all field devices and associated wiring.
Without this extension, failures in process sensors and other field devices
or wiring may dominate the system failure rate and negate the benefits of
redundant controllers. A thorough system safety and reliability analysis

Copyright International Society of Automation

must include the valves, sensors, field transmitters, limit switches, sole-
noids, and other devices, along with the associated wiring, junction boxes,
and connections.

Field devices with microprocessors built into them are known as smart
devices. The ability to put a microprocessor into a field device allows
diagnostic capabilities never before possible. Techniques formerly used in
a controller module are now practical within a field device.

This capability is important. In some cases the diagnostics must be done in

the field device to be effective. In addition, environmental stress factors
are high in areas where such devices are typically used. This amplifies the
need for diagnostics since high stress factors can cause higher failure rates.

Plugged impulse (pressure sensing) lines in differential pressure measure-

ment applications have been a serious problem in many processes for
years. It is well-recognized that process noise in the signal increases when
one impulse line in a differential pressure measurement is plugged, and
sharply decreases when both impulse lines are plugged (Figure 9-13).
Attempts have been made to detect plugged impulse line failure by statis-
tically measuring noise in the differential pressure using a DCS (distrib-
uted control system) or PLC. But the scan rates of those devices have been
too slow to measure higher frequencies in the process noise. The field
device can sample much more rapidly and therefore measure higher fre-
quencies (Ref. 9).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 9-13. Differential Pressure Measurements with Plugged Impulse Lines (Ref. 9)

In safety instrumented systems, the final element is often a remote-actu-

ated valve that does not move unless there is a potentially dangerous con-
dition. In many processes this is a rare event, and the valve may sit

Copyright International Society of Automation

motionless for years. Many component failures occur that can cause the
valve to stick. These failures include the cold welding of O-rings and seals
and corrosion between moving parts. A controller or a smart device in the
field can automatically test for this condition and indicate the failure. A
partial stroke test can be set up to move the valve a small amount. This can
detect a significant percentage of dangerous failures in the final element

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
assembly (Ref. 10).

Diagnosing failures in some field devices, even dumb field devices, can
be done with intelligent input/output circuits in a control system module.
Output current sensors can detect field device failures. If the average out-
put current exceeds an upper bound for too long, this indicates a short cir-
cuit failure of the load device or the field wiring. If an output channel is on
and a minimum current is not being drawn, this indicates open circuit fail-
ure of the field device or the associated wiring.

One common failure in solenoid-operated valves is burn-out of the coil.

Temperatures in the coil can be quite high, especially in normally ener-
gized SIS applications where the coil is energized 24 hours day. Insulation
on the wire in the coil can eventually break down under high temperature
stress and short out adjacent windings in the coil. When this happens, the
current consumption increases and the temperature goes up, causing an
eventual burn-out of the coil. A coil current monitor can detect the failure
before the coil burns out and if repairs can be made soon enough, a system
failure (false trip) can be avoided (Figure 9-14).

Figure 9-14. Solenoid Coil Burnout Diagnostic

As another example, current sensors on control device input channels are

able to distinguish whether a contact is open or closed, or whether the
input wire is open. Special sensors can detect when the I/O power source
current does not equal the I/O power return current. This can be used to
detect a failed field device in which a current leak to ground has
developed.

Copyright International Society of Automation

Comparison diagnostic techniques are popular for use with field sensors,
analog and discrete. For discrete sensors, Boolean comparisons can detect
differences in sensors, although care must be taken that scanning differ-
ences and noise do not cause false diagnostics. The logic of Figure 9-15 has
a timer to filter out temporary differences between two discrete inputs
labeled A and B. A Time OFF (TOFF) timer is used for the normally ener-
gized signals. In Figure 9-16, 2oo3 voting logic compares three discrete
inputs labeled A, B and C. The majority signal drives a coil. Additional
logic could be added to specifically compare each of the three combina-
tions of two signals. When one signal appears in two comparison mis-
matches, it is likely to be the bad signal.

TIMER
TOFF

Diagnostic
Alarm

Figure 9-15. Discrete Input Comparison Diagnostic

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 9-16. 2oo3 Voting Logic for Discrete Signals

An equivalent technique can be used for analog. Analog signals are sent to
a median selector. The median selectors output is used as the process sig-
nal. In addition, comparisons must be made between analog signals, look-
ing for differences greater than a certain magnitude and that last longer

Copyright International Society of Automation

than a reasonable (and, in practice, specified) amount of time. Again, the

designer should be careful about scanning differences and noise. Make
sure the diagnostic limits are not set too narrow.

Comparison to reference techniques can also work well at the system

level. The system designer expects certain things to happen in certain
ways. Function block software can be added to the system configuration
to verify that the system is operating as expected. Communication links
can be verified with a watchdog timer function. Similar techniques using
watchdog or windowed watchdog timers can detect complex system
failures.

Fault Injection Testing

The coverage calculation done in a FMEDA can be confirmed by fault
injection testing. Various test techniques may be used automatically or
manually simulate component failure modes. For example, a short circuit
(or low resistance) can be connected across a component to simulate a
short circuit failure. Component leads can be cut to simulate open circuit
failures. Other methods that have been used induce component failure
including high intensity light or physical force (Ref. 11, 12, and 13).
Remember, however, that not all component failure modes can be physi-
cally simulated.

The purpose behind such component failure simulation testing is to verify

the corresponding on-line diagnostic test. Some on-line diagnostic tests do
not perform as well as expected. Some tests detect failures that were not
anticipated. More extensive testing results in a more accurate coverage
factor estimate.

Exercises
9.1 Records indicate that actual repair time for the equipment in the
plant averages four hours. What is the repair rate for immediately
detected failures?
9.2 A dual controller system has the ideal Markov model shown in
Figure 9-1. Using the repair rate from exercise 9.1 and a controller
module failure rate of 0.0001 failures/hour, what is the system
MTTF?
9.3 A dual controller has a watchdog timer diagnostic with a coverage
of 0.7. Considering the more detailed Markov model of Figure 9-2
that considers diagnostic coverage, what is the MTTF using the
failure and repair rate numbers from exercise 9-2?

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

9.4 If the diagnostic coverage from exercise 9.3 is increased to 0.95,

what is the MTTF?
9.5 Describe a method of diagnosing a burned-out thermocouple.
9.6 What diagnostic could detect a shorted 420 mA transmitter?
9.7 A digital-to-analog (D/A) converter in a smart transmitter could
fail in a mode such that the output became frozen. How could such
a failure be detected?
9.8 A pressure switch to detect overpressure in a boiler is normally
closed. When the pressure goes high (a dangerous condition) the
switch opens. This switch could fail short circuit and would not
indicate the dangerous condition if it occurred. How can this fail-
ure be detected?
9.9 The pressure switch in exercise 9.8 has SPDT (single pole, double
throw) contacts. How can these be used to implement a diagnostic?

Answers to Exercises
9.1 The repair rate for immediately detected failures is 0.25 repairs per
hour.
9.2 Using equation 9-1, the MTTF = 12,515,000 hours or 1428 years!
(8760 hours per year)
9.3 Using equation 9-4, the MTTF now equals 26,651 hours or 3 years.
This is much less than the answer from 9.2 showing the impact of
realistic modeling that includes diagnostic coverage.
9.4 Using 9-4, the MTTF is 109,246 hours or 12.47 years.
9.5 Thermocouple burnout results in an open circuit of the thermocou-
ple. One technique for detecting this failure would involve con-
necting a small current source so that the voltage across the
thermocouple would go to a few volts (instead of the normal milli-
volts) when an open occurs.
9.6 A shorted 420 mA transmitter in a 2-wire circuit could be detected
if the current goes above the 20-mA rating by a significant amount
(a 21-mA threshold is recommended.)
9.7 A frozen D/A converter might be detected by checking to deter-
mine if the digital output remains the same for several readings.
This assumes that normal process noise creates variations for each
reading when the D/A is working.
9.8 For most switches there is no electrical difference between a closed
switch and a short circuit. Therefore detection of such a failure is
extremely hard. Most control system designers are replacing
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

switches with analog sensors in order to add diagnostic capability

to the system.
9.9 SPDT contacts can be used for diagnostics by knowing that one
pair of contacts is open when the other is closed. These two sets of
contacts should be wired to two inputs and the inputs should
always read 0/1 or 1/0. This technique will detect some switch
failure modes but not all, such as contact actuator failures.

References
1. Smith, S. E. Fault Coverage in Plant Protection Systems. ISA
Transactions, Vol. 30, Number 1, Research Triangle Park: ISA, 1991.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

2. Goble, W. M. and Speader, W. J. 1oo1D - Diagnostics Make Pro-

grammable Safety Systems Safer. Proceedings of the ISA92 Confer-
ence and Exhibit, Research Triangle Park: ISA, 1992.

3. Bouricius, W. G., Carter, W. C., and Schneider, P. R. Reliability

Modeling Techniques for Self-Repairing Systems. Proceedings of
ACM Annual Conference, 1969; Reprinted in Tutorial Fault-Tolerant
Computing (Nelson, V. P. and Carroll, B. N., eds.) Washington, DC:
IEEE Computer Society Press, 1987.

4. Programmable Electronic Systems in Safety Related Applications. Shef-

field, U.K.: Health and Safety Executive, 1987.

5. Goble, W. M. High Availability Systems for Safety and Perfor-

mance The Coverage Factor. Proceedings of the ISA 89 Conference
and Exhibit. Research Triangle Park: ISA, 1989.

6. Harris, D. E. Built-In-Test for Fail-Safe Design. 1986 Proceedings

of the Annual Reliability and Maintainability Symposium, New York:
IEEE, 1986.

7. Internal Report of the Probability-of-Failure-on-Demand of S7-F

SAFETY-PROGRAMMABLE-SYSTEM, Rev. 4.1, 12/12/2000.
Mnich: TV Product Service GmbH, 2000.

8. Goble, W. M., Markov Model Analysis, DeltaV SIS, FRS 03/02-15 R

003, V3R2. Sellersville, PA: exida, May 2006.

9. Wehrs, D. Detection of Plugged Impulse Lines Using Statistical Process

Monitoring Technology, http://www.emersonprocess.com/Rose-
mount/document/notes/3051S_Plugged_Line_Detection.pdf.
Chanhassen: Rosemount, 2006.

Copyright International Society of Automation

10. van Beurden, I. and Amkreutz, R. The Effects of Partial Valve Stroke
Testing on SIL Level. Sellersville, PA: exida, 2001.

11. Hummel, R. A. Automatic Fault Injection for Digital Systems.

1988 Proceedings of the Annual Reliability and Maintainability Sympo-
sium, New York: IEEE, 1988.

12. Lasher, R. J. Integrity Testing of Control Systems. Control Engi-

neering, Oak Brook: February 1990.

13. Johnson, D. A. Automatic Fault Insertion. InTech, Research Trian-

gle Park: ISA, November 1994.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Common-Cause Failures
A common-cause failure is defined as the failure of more than one device
due to the same stress (cause). A common-cause failure negates the bene-
fits of a fault tolerant system [Ref. 1, 2, and 3]. For example, many fault tol-
erant systems provide two modules to prevent system failure when a
module failure occurs. If both modules fail due to a single stress, such a
fault tolerant system will fail; a similar problem occurs when a fault toler-
ant system provides three or more redundant devices. This is not a theo-
retical problem. Actual study of some field installations has shown that
the reliability metrics PFD, MTTF, and so on, are much worse than reli-
ability models predicted. The autopsy reports of failures in these situa-
tions indicate that, in some cases, more than one redundant device has
failed due to a common-cause stress.

In one case history, the door of a control rack cabinet was opened to check
on a status display. Just before the maintenance technician was finished, a
call came on the walkie-talkie: Its time for lunch. The simple response
was, Ill be there soon. In the cabinet were three controller modules
mounted in the same card rack in a fault tolerant system. When the
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

walkie-talkie transmitted the technicians reply, all three were subjected to

the same electromagnetic field stress and failed. These three modules were
part of a safety instrumented system, and because automatic diagnostics
detected improper program execution, a major process unit was immedi-
ately shut down. For this failure:

Title: Common-Cause Failure System de-energized

Root Cause: Radio transmission from walkie-talkie
Failure Type: Random

201
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
202 Control Systems Safety Evaluation and Reliability

Primary Stressor: Electromagnetic radiation

Secondary Stressor: Cabinet door open

Several loose wire lugs in the bottom of a cabinet created excess resistance
in I/O conductors that normally carry several amps. These high resistance
connections were generating heat in the cabinet. Two microprocessor
cards in the cabinet were configured redundantly. As the temperature
went up, the precise timing needs of the digital signals were no longer met
and both microprocessors failed in a short period of time. The system
failed and shut down a chemical process. For this failure:

Title: Common-Cause Temperature problem

Root Cause: Loose wire connections in cabinet
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Failure Type: Random

Primary Stressor: Temperature
Secondary Stressor: Possible electrical noise

An engineer was adding new logic to a dual redundant PLC. When the
download command was sent, a memory full error message was
received just before both units crashed and shut down. For this failure:

Title: Common-Cause Software Failure

Root Cause: Adding new logic on-line
Failure Type: Systematic
Primary Stressor: Memory overstress
Secondary Stressor: Insufficient testing

An equipment cabinet was mounted over the raised floor in a control

room. Several rodents found their way into the floor area and into the
equipment cabinet. They built a nest above the control rack and began to
urinate into the safety instrumented system (SIS). One stream of liquid hit
redundant circuits mounted on a module and caused failure of the entire
system. For this failure:

Title: Common-Cause Failure SIS

Root Cause: Rodent urination
Failure Type: Random
Primary Stressor: Liquid urine
Secondary Stressor: None

A new engineer noticed that the pressure readings in the boiler burner
management system (BMS) were a little off. He recalibrated all three trans-
mitters using the wrong procedure and set all three to the wrong range. If
the pressure ever went into shutdown range, none of the three transmit-
ters would have sent the correct signal. For this failure:

Copyright International Society of Automation

Title: Common-Cause Failure BMS

Root Cause: Wrong range on pressure transmitter
Failure Type: Systematic
Primary Stressor: Maintenance error
Secondary Stressor: Poor calibration procedures and/or training

Two valves were piped in series to ensure fuel flow could be shut off
when one or both were closed. If one valve should stick open, the second
valve would do the job. To avoid power dissipation during the long peri-
ods where the valves would not be used, the system was designed to be
energized to trip. A fire started near the process unit. This was sensed by
the safety instrumented system and both valve outputs were energized.
Unfortunately, the cables for the valves were routed through the same
tray and that tray was over the fire. Both cables burned, and the valves did
not close. For this failure:

Title: Common-Cause Failure Wiring

Root Cause: Redundant cables in same physical location
Failure Type: Random
Primary Stressor: Fire
Secondary Stressor: Design weaknessenergize to trip

All of the things that cause failure can be the source of a common-cause
failure. As we have seen, they may be internal and include design errors
and manufacturing errors. They may be external and include environmen-
tal stress, maintenance faults, and operational errors.

Environmental stress includes electrical, mechanical, chemical, and physi-

cal stress.

Electrical stress includes voltage spikes, lightning, and high current levels.
Mechanical stress includes shock and vibration. Chemical stress includes
corrosive atmospheres and salt air. Physical stress includes temperature
and humidity.

Design errors, most often software design errors, are a major source of
common-cause failure. Consider the design process. The complexity of
modern control systems increases the chances of design errors. In addi-
tion, during product development, testing is extensive, but system com-
plexity may prevent complete testing. In many cases, the system is not
tested in such a way that a design fault is apparent. Then, one day, the sys-
tem receives inputs that require proper operation of something that was
designed incorrectly. System elements do not operate as needed. The sys-
tem does not operate properly. By definition, this is a system failure. If

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

redundant elements are identical, they will suffer from the same design
errors and may fail in exactly the same way, given the same inputs.

Operational errors or maintenance faults can be responsible for common-

cause failures. During operation of a redundant system, certain actions
can affect the multiple resources of the system. These actions include sys-
tem configuration errors, incorrect shutdown commands, or wrong output
commands. Maintenance actions, including incorrect upgrade and instal-
lation, repair procedure errors, bad calibration, and failure to replace
renewable resources (batteries, disks, etc.), can affect multiple compo-
nents. Often the same procedures are used on all portions of a redundant
system. Thus, common-cause failures occur.

Stress-Strength View of Common Cause

Remember that all failures occur when a stress exceeds the associated
strength. This concept was presented in Chapter 3 where many types of
stressors were described. Figure 10-1 shows the stress-strength view of
common-cause failures. The strength probability distribution is shown as
a dotted line. Three values of strength are shown from within the valid
range representing three modules from a fault tolerant system. A value of
stress is shown that is larger than all three strength values. All three units
will fail under these conditions.

Figure 10-1. Stress versus Strength Common Cause

Ironically when identical redundant devices are exposed to a common

stress this situation gets worse. When the manufacturing process is tightly
controlled and consistent, the strength of identically manufactured
product is similar, that is, the probability distribution of strength has a
small sigma. When this happens the likelihood of common-cause failure is
increased (Figure 10-2). Taken to the extreme, if a redundant system could
be built from identical units with identical strengths exposed to identical
stress, all failures would be common-cause! Of course this is not realistic
since strength varies due to many factors and even some physical
separation between redundant units means some variation in stress. But if
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

that situation were realistic, there would no benefit to identical device

redundancy.

Figure 10-2. Similar Strength Yields Greater Chance of Common-Cause Failure

Common-Cause Modeling
There are several models available to predict the effects of common-cause
susceptibility. One of the models should be used in the safety and reliabil-
ity analysis of redundant systems. Without proper consideration of com-
mon-cause effects, safety and reliability models can produce extremely
optimistic results.

The Beta Model

One of the simplest models divides the failure rate of each component into
common-cause (two or more fail) and normal (one fails). A fractional mul-
tiplication factor known as the beta () factor is used to subdivide the fail-
ure rates as shown in the Venn diagram of Figure 10-3.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 10-3. Venn Diagram of Failure Rate Showing Beta Factor

Copyright International Society of Automation

The area within the rectangle of Figure 10-3 represents the total rate of
stress events (stress rate) where stress is high enough to cause a failure.
When only one component is subjected to the stress, the stress rate equals
the failure rate. Thus, the area within the rectangle represents the rate at
which one or more components fails due to stressthe failure rate. Over a
portion of the area, stress is high enough that two or more units fail due to
stress. That portion is designated with the Greek lower case letter beta.
The beta factor is used to divide the failure rate into the common-cause
portion, C and the normal (independent) portion, N. The following
equations are used:

C
= (10-1)

and

N
= (1 ) (10-2)

EXAMPLE 10-1

Problem: A system has two power supplies. The system is

successful even if only one power supply operates. Assume each
power supply has a failure rate of 25,000 FITS. We also assume that
only one failure mode exists. A Markov reliability model for this
system is shown in Figure 10-4.

Figure 10-4. Markov Model for a Dual Power System

State 0 represents the situation where both power supplies are

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

successfully operating. In state 1, one power supply has failed but the
system is successful. In state two, both power supplies have failed
and the system has failed. If common cause were not considered, 1
would be 50,000 FITS (25,000 FITS for each operating power
supply), 2 would be 25,000 FITS, and 3 would be zero. What would
the failure rates be for 1, 2 and 3 if the beta factor is 0.1?

Copyright International Society of Automation

EXAMPLE 10-1 continued

Solution: Using the beta model, the failure rates for each power
supply are divided into normal and common cause. Using a beta
factor of 0.1:
C
= 25,000 = 2,500
N
= ( 1 ) 25,000 = 22,500
For the beta factor of 0.1, the common-cause failure rate is 2,500
FITS. The normal mode failure rate is 22,500 FITS for each power
supply. The failure rates in the Markov model are:

1 = 2 22,500 = 45,000 FITS

2 = 25,000 FITS
3 = 2,500 FITS

Note that no common-cause factor is applied to 2 since only one

power supply is operating successfully in state 1. Common cause is a
concept that applies only when multiple units are susceptible to
failure. Note that the total failure rate from state 0 is 47,500 FITS and
not 50,000 FITS. This seems counter-intuitive since the overall failure
rate is lowered when common-cause modeling is applied. However,
even though the overall failure rate from state 0 is lower, this does not
improve system reliability. The reliability of a system with common-
cause failure is actually lower because common-cause failure
introduces additional pathways to the failure state.

Although the beta model is relatively understandable and easy to apply, it

does not distinguish between two, three, or more failures due to a com-
mon cause. While this capability is not needed for most control architec-
tures composed of dual systems, it is necessary to distinguish between
two or three failures in a triple system. For greater accuracy, it is also nec-
essary to separate common-cause failures of two, three, or four or more
units in systems that use four or more units in a fault tolerant architecture.
This can be done by establishing scaling factors for beta or by using a more
complicated model.

Multiple Error Shock (MESH) Model

The MESH model is more complicated than the beta model but it does dis-
tinguish between failures of two, three, or more units. The MESH model
recognizes that failures occur when a system experiences a stress event.
Some stress events are of sufficient magnitude to cause multiple failures
(more than one unit fails due to the stress event). The model is described
in detail by Hokstad and Bodsberg in Reference 4. Note that a stress event
is called a shock by those authors. The model presented here is modified

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

in that it assumes that all failures are due to a stress event (all failures
occur when stress exceeds an associated strengthChapter 3).

The objective of the MESH model is to calculate the failure rates of one,
two, three, or more failures per stress event. These failure rates are defined
as:

(1) = the failure rate where one unit fails per stress event;
(2) = the failure rate where two units fail per stress event;
(3) = the failure rate where three units fail per stress event;
and so on.
(n) = the failure rate where n units fail per stress event.

The calculation starts with an estimate of the probability that one, two,
three, etc. units will fail per stress event. These probabilities are:

P1 = The probability that one unit will fail per stress event;
P2 = The probability that two units will fail per stress event;
P3 = The probability that three units will fail per stress event;
and so on.
Pn = The probability that n units will fail per stress event.

Note that the sum of all the probabilities must equal one since every stress
event will fail some quantity of units.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The stress event rate is represented by the Greek letter nu, . If only one
unit is exposed, the stress event rate will equal the failure rate, = . If
multiple units (n = the number of units) are exposed to the stress event
then sometimes more than one unit will fail with each stress event. Under
such conditions, the stress event rate is less than n times the failure rate:

<n (10-3)

The relationship between stress event rate and individual failure rates is
given by

=n/M (10-4)

where M is the average number of units failed per stress event. This can be
calculated using the formula for the expected value of a discrete probabil-
ity density function (Equation 2-8).

M = P1 + (P2 2) + (P3 3) +... + (P n) (10-5)

Copyright International Society of Automation

Once M is calculated, the stress event rate can be calculated using equa-
tion 10-3. The failure rates for one unit, two units, three units and so forth
are calculated using:

(1) = P1;
(2) = P2;
(3) = P3;

and so on until (n) = Pn.

EXAMPLE 10-2

Problem: Consider the dual power supply of Example 10-1. The

system is successful if only one power supply operates. Each power
supply has a failure rate of 25,000 FITS and only one failure mode.
An expert estimates that P1 = 0.95 and P2 = 0.05. Calculate (1) and
(2). Show all failure rates on a Markov model.
Solution: = 25,000 FITS and n = 2.

First, calculate the average number of failures per stress event:

M = P1 + (2 P2) = 1.05.

Calculate the stress event rate: n = 2 25,000 FITS / M = 47500

10-9 stress events per hour.

(1) = P1 47500 10-9 = 45000 10-9 failures per hour and

(2) = P2 47500 10-9 = 2500 10-9 failures per hour.
These are shown on Figure 10-5. Note that the failure rates are
nearly the same as those calculated by the beta model (and equal if
P1 = 0.947368421). The MESH model will give the same results for
n = 2 systems as the beta model if the parameters are estimated with
the same considerations.

Figure 10-5. Markov Model of Dual Power System with MESH Failure Rates

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 10-3

Problem: Two sensors are used to measure an important process

variable. The safety model should include common cause. Each
sensor has a failure rate of 0.001 failures per hour. An expert
estimates that P1 = 0.8 and P2 = 0.2. Calculate (1) and (2).

Solution: = 0.001 failures per hour and n = 2.

First, calculate the average number of failures per stress event:

M = P1 + (2 P2) = 1.2.

Calculate the stress event rate: = 2 0.001 / M = 0.001667 stress

events per hour.

(1) = P1 = 0.001334 failures per hour and

(2) = P2 = 0.000334 failures per hour.

EXAMPLE 10-4

Problem: What is the equivalent beta for Example 10-3?

Solution: For an n = 2 system, (2) in the MESH model is equal to

in the beta model. Therefore = 0.000334 / 0.001 = 0.334.

Using Beta Beyond Dual Systems

It is important to remember that common cause must be modeled in reli-
ability and safety models as the results may be quite optimistic if common
cause is ignored. Although the beta model is simple and easy to apply, as
we have seen it suffers from the limitation that it does not distinguish
between two, three, or more failures due to a common stress. The beta
model can be extended [Ref. 5] to add capability as required to distinguish
the number of failures. But this approach requires the estimation of multi-
ple parameters just like the MESH model. Given the published techniques
for estimating beta, it is more practical to scale the beta number. The beta
model may be applied to 2oo3 systems by simply scaling the beta number.

Some references multiply the beta number by three when modeling 2oo3
systems [Ref. 6, 7 and 8]. The 2oo3 architecture fails if two of the three
units fail. Therefore, the beta number should be multiplied by three
because there are three combinations of two units exposed to the stress.
Any combination of those three will cause system failure. However, com-
mon-cause failure simulation and further research show that the multi-
plier is 3/2 [Ref. 9]. Examples of how this is used are in Chapter 14 where
2oo3 systems have a 3/2 multiplier on the beta factor.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Common-Cause Avoidance
Control system designers at all levels must recognize that common-cause
failures drastically reduce system safety and reliability in redundant sys-
tems. Therefore these systems must be designed to achieve desired reli-
ability goals even when common-cause failure rates are included in
reliability models. Designers must recognize the failure sources that are
responsible for common-cause failures. Specific solutions must be imple-
mented to combat common-cause failures. The common-cause defense
rules can be grouped into categories that result in three basic rules.

RULE 1: Reduce Common Stress

When redundant units are physically and electrically separated, there is less
likelihood of their being subjected to a common environmental stress. Phys-
ical separation and electrical separation are important. Redundant units
should not be physically mounted side by side, where the physical and elec-
trical common stress will be nearly identical. In such situations, the beta fac-
tor will be higher. Most environmental stress factors vary non-linearly as a
function of physical distance.

Many environmental failure sources can be reduced in effect by putting

redundant equipment in separate cabinets. Control systems that have
physically separated redundant equipment will be less susceptible to envi-
ronmental common-cause failures simply because the environmental
stressors common to both (or all) have been reduced.

In software, common stress can be reduced by asynchronous operation

(less chance of identical inputs with identical timing). Lock-step synchro-
nization between processors should be avoided. The use of different oper-
ating modes between software in redundant processor units is another
way to reduce common stress for software. The data stored in each unit
can also be different.

RULE 2: Apply Diversity

A second common-cause defense technique is diversity. Diversity is a
concept in which differently designed units are wired together in a redun-
dant configuration. The motivation is simple: Different unit designs will
not suffer from the same common-cause design faults. In addition, units
manufactured to different designs, and using different components, will
not suffer from the same manufacturing faults.

The technique has been tested [Ref. 10 and 11] and has had some success
in hardware and software. However, there are serious tradeoffs and cost
considerations that must be taken into account. The testing has shown
design diversity has not eliminated all common-cause design errors. Sys-
tem-level design errors have not been eliminated. In addition, many new
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

problems having to do with synchronization, calibration, and data mis-

match due to digital round-off have appeared.

In terms of environmental common causes, redundant units using differ-

ent designs may increase common-cause strength only if the designs
respond differently to a common stress. For diversity to be effective, the
redundant units must not respond the same way to a common stress. The
use of two different companies devices may provide some benefit consid-
ering the possibility of common design and manufacturing defects, but sig-
nificant benefits will not be realized if multiple units respond similarly to
the same stress. The use of different companies devices adds maintenance
and operational complexity as calibration procedures and repair proce-
dures are different and that may override the benefits. Hardware diversity
will work only if the redundant units respond differently to a common
stress. A mechanical unit backing up an electrical unit would be a good
application of hardware diversity.

Software diversity implemented by using different software written in dif-

ferent languages by different people can be achieved. This can be very
expensive and should not be done at the sacrifice of software strength (put
the effort into software quality first).

There are significant tradeoffs between the extra effort required to repli-
cate a design more than once (extra design training, extra design docu-
mentation, extra maintenance training, extra spare parts, etc.) and the
effort required to avoid faults during design. The extra complexity created
when connecting diverse machines into a fault tolerant system creates
design faults. Given the new, inevitable problems caused by diversity,
such systems should be approached with caution.

RULE 3: Ruggedize the Design for High Strength

Controllers that are designed to withstand tougher environments will
work better in fault tolerant configurations, because the threshold of envi-
ronmental stress required to fail multiple units is higher. The system fea-
tures that lower the component failure rate will also lower common-cause
failure rate. Good heat sinking, coated circuit boards, rugged module cov-
ers, and secure mechanical connectors lower the component failure rate
because these features increase strength. If a module is less likely to fail at a
certain stress level, it is less likely to have a common-cause failure. The
greater the design strength margin, the less likely will be common-cause
failure.

As complexity grows in control systems, common-cause software design

errors increase in proportion. Of course, fault avoidance design techniques
should be used. A systematic software development process will reduce the

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

chance of a software error and increase software strength. Software

strength is also increased by use of design techniques from the require-
ments of IEC 61508. These techniques require the designer to put in limit
checking, range checking, timing checking, and other measures, which pre-
vent incorrect operation triggered by bad data and other stress events for
software. These things increase software common-cause strength. Third-
party audit and inspection (exida, TV) can help avoid design errors. For
example, such audits are typically included in the development process of
safety critical PLCs when they receive IEC 61508 certification.

The operation and maintenance of a system can generate common-cause

failures. Incorrect commands sent to synchronously operating controllers
will cause both to fail. Complex operations should be automated when-
ever possible. Foolproofing techniques should be used for operations and
maintenance. Repairable assemblies should be keyed so modules and con-
nectors cannot be installed improperly. Preventive maintenance can also
be helpful in maintaining high strength in some situations.

Estimating the Beta Factor

Opinions of experts put the beta factor in the range of 0.005 to 0.11 for
hardware failures. The beta factor has been estimated to be in the range of
0.05 to 0.6 for software failures. Reference 12 is a study of redundant sys-
tems on the U.S. Space Shuttle. This reference estimates the beta factor at
0.11 for hardware. With this range of values in the models, differences in
reliability and safety metrics of several orders of magnitude will result.
Therefore it is important to carefully estimate the beta factor and include
common-cause susceptibility in quantitative analysis.

A qualitative evaluation of system implementation can be used to estimate

the beta factor (Table 10-1). This table presents a simplified method of esti-
mating that is consistent with the listed references.

A more extensive method developed by R.A. Humphreys is published in

Reference 13. Humphreys evaluates:

Separation physical separation rule 1

Similarity the level of diversity rule 2
Complexity a measure of strength of design assuming that more
complex designs are more likely to fail
Analysis a measure of the effectiveness of design testing
Operating Procedures a measure of system strength against opera-
tional errors

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Table 10-1. Beta Factor Evaluation Chart

Evaluation Question Add to the beta factor
1. Pick level of physical separation:
Are redundant channels on the same printed circuit board (PCB)? 0.03
or Are redundant channels in the same card rack or mounted side by side? 0.01
or Are redundant channels in the same cabinet? 0.005
or Are redundant channels in different cabinets? 0.002

2. Pick the level of electrical separation:

Are redundant channels on the same power common (non-isolated)? 0.01
or Are redundant channels high frequency isolated? 0.005
or Are redundant channels galvanically isolated (DC and AC)? 0.002

3. Pick the level of diversity:

Are redundant channels identical technology such as PE
(Programmable Electronic)? 0.02
or Are redundant channels similar technology such as
PE and E (Electronic)? 0.01
or Are redundant channels diverse such as PE/E and
mechanical/hydraulic? 0.0001
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Training a measure of system strength against operational and main-

tenance errors
Access Control a measure of access to the equipment. Strict access
control would reduce the chance of an operational or maintenance
error
Environmental Testing a measure of strength in the product rule 3

This method accounts for many legitimate human common-cause error

sources that are not included in Table 10-1. The Humphreys method does
a better job with high strength systems. Table 10-1 does not give credit
for a rugged design that is probably less likely to experience a common-
cause failure. If rugged, high strength components are being used or if
human factor failure rates are included in the models then the Humphreys
method is preferred.

EXAMPLE 10-5

Problem: A PLC is implemented with redundant controllers mounted

side by side in one rack. They are electrically isolated via different
power supplies and galvanically isolated communications circuits.
They are identical in design and manufacture. What is the estimated
beta factor?

Copyright International Society of Automation

EXAMPLE 10-5 continued

Solution: Start with = 0. The redundant controller circuits are not

on the same PCB but they are in the same rack. Add 0.01 to the beta
factor. The units are galvanically isolated so add 0.002. Both
controllers are programmable electronic so add another 0.02. The
estimated beta factor is 0.032.

EXAMPLE 10-6

Problem: A PLC is implemented with redundant controllers mounted

in different cabinets. They are electrically isolated via different power
supplies and galvanically isolated communications circuits. They are

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
identical in design and manufacture. What is the estimated beta
factor?

Solution: Start with = 0. The redundant circuits are physically

separated in different cabinets so add 0.002 to the beta factor. The
units are galvanically isolated so add 0.002. Both controllers are
programmable electronic so add another 0.02. The estimated beta
factor is 0.024.

EXAMPLE 10-7

Problem: Redundant I/O circuits are provided within one module.

They are not isolated. They are identical in design. What is the
estimated beta factor?

Solution: Start with = 0. The redundant circuits are on the same

PCB so add 0.03 to the beta factor. The units are not isolated so add
0.01. Both modules are programmable electronic so add another
0.02. The estimated beta factor is 0.06.

Additional research is needed into the nature of common-cause failures.

Methods such as common environment minimization and diversity can
avoid some common-cause failures. However, improvements are still
needed. System design techniques must be further refined. Methods of
increasing component strength must be pursued. Architectures without
common susceptibility must be refined. At least, common cause is starting
to become recognized as a major factor to be considered during the design
of a system or in the evaluation of system reliability.

Estimating Multiple Parameter Common-Cause Models

Most redundant architectures in control system designs use two or three
redundant units primarily because further redundancy increases cost with

Copyright International Society of Automation

little benefit. In systems with two or three redundant units, it is not neces-
sary to use multiple parameter common-cause models. The simplest
approach to modeling triple systems is to use the beta model with the 3/2
factor when three units are exposed to the same stress [Ref. 9] and beta
alone when two units are exposed to the same stress.

Occasionally (as with nuclear systems) a fourth unit is added to the design
so that three units can be active even if one unit is removed from service
for maintenance. In systems with four redundant units an accurate model
might use the MESH or the extended beta model for common-cause fail-
ures. When estimating parameters for the MESH model, most agree that
P2 > P3 > P4, etc. Most estimates put the ratio of the factors in a range of
2X to 10X. When using the extended beta model, most estimates put the
2 / 3 ratio at a value of 2X to 10X as well. Little hard statistical evidence
has been published to support these estimates, however.

Including Common Cause in Unit or System Models

Common-cause effects can be modeled at the unit or system level using
any of the techniques presented earlier in the book: reliability block dia-
grams, fault trees, and Markov models. It should be remembered that
common-cause failures occur when two or more devices fail at the same
time because of the same stressor. Additions can be made to each model
type to show this event.

Common-cause Modeling with Reliability Block Diagrams

A simple redundant system is shown in a reliability block diagram as two
blocks in parallel. If either block operates, the system operates. If both
blocks fail due to a common cause, the system fails. Thus, another block is
added in series as shown in Figure 10-6. The failure rates for independent
(normal) and common cause are used to calculate probabilities of success-
ful operation. These probabilities are combined in the model.

Figure 10-6. Reliability Block Diagram Showing Common Cause

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 10-8

Problem: A system has two power supplies. One operating power

supply is required for the system to be successful. A power supply
has a constant failure rate of 0.00005 failures per hour. We estimate a
common-cause beta factor of 0.05. What is the system reliability for a
time interval of 1000 hours with and without considering the effects of
common cause?

Solution: Without common cause, the reliability of one power supply

for 1000 hours = e(-0.00005 1000) = 0.95123. The unreliability =
1 - 0.95123 = 0.04877. For a parallel system, the unreliability =
0.04877 0.04877 = 0.002379 and the reliability = 0.99762.

When including the effects of common cause, the block diagram of Figure
10-6 is used. The normal failure rate for a power supply is (1-) = 0.95
0.00005 = 0.0000475 failures per hour. The common-cause failure rate =
0.05 0.0005 = 0.0000025 failures per hour.

The reliability for block A = e(-0.0000475 1000) = 0.95361.

The unreliability for block A = 1 - 0.95361 = 0.04639.

The reliability for block B = e(-0.0000475 1000) = 0.95361.

The unreliability for block A = 1 - 0.95361 = 0.04639.

The reliability for the parallel combination of A and B = 1 - (0.0461392) =

0.99785.

The reliability for the common-cause block = e(-0.0000025 1000) = 0.99750.

The reliability for the power system including common cause = 0.99785
0.99750 = 0.99536.

Common-cause Modeling with Fault Tree Diagrams

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

The effects of common cause can be modeled using a fault tree in a man-
ner similar to the application of reliability block diagrams. A fault tree
showing common cause is drawn in Figure 10-7. Again, two power sup-
plies are provided, either of which can operate the system. Without con-
sidering common cause, both must fail in order for the system to fail.
When common cause is considered, it is the equivalent of another failure
that will fail the system. This is shown as another input to an OR gate.

Copyright International Society of Automation

Figure 10-7. Fault Tree Diagram Showing Common Cause

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 10-8. Markov Model Showing Common Cause

Common-cause Modeling with Markov Models

Common-cause failures can be shown on a Markov model quite easily as
seen earlier. Figure 10-8 shows a dual system with one failure mode. In
state 0, both devices are operating and the system is successful. In state 1
the system has degraded; one device has failed but the system is still suc-
cessful. State 2 represents the condition where both devices have failed
and hence the system has failed. A common-cause failure takes the system
directly from state 0 to state 2; this is shown by the arc labeled . This
technique can be extended to show various combinations of multiple fail-
ures as required.

Copyright International Society of Automation

In the Markov model of Figure 10-9 a system consisting of three devices is

shown. The system operates successfully if only one of the three devices is
operating. One failure mode is assumed. No common cause has been
included. State 0 shows the condition where all three devices are success-
ful. State 1 shows the condition where two of the devices are operating
and one has failed. In state 2, two devices have failed and one is operating.
In state 3, all three devices have failed and the system has failed.

Figure 10-9. Markov Model Showing a Three Device System Without Common Cause

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 10-10 shows the same system with possible common-cause failures
included. From state 0, one, two or three devices can fail per stress event.
The arc marked 3(1) means three devices exposed, one fails. The arc
marked 3(2) means three devices exposed, two fail. The arc marked 3(3)
represents the case where all three devices fail due to a common cause.
Other arcs are marked with the other possibilities for common-cause fail-
ure. The Markov modeling method is quite flexible when considering
common cause.

Figure 10-10. Markov Model Showing a Three Device System with MESH Common
Cause

Copyright International Society of Automation

Exercises
10.1 List sources of common-cause failures.
10.2 How can one software fault cause the failure of two redundant
controllers?
10.3 Describe the concept of diversity.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
10.4 How could you achieve hardware and software diversity in a sys-
tem design?
10.5 Redundant pressure sensors are used to measure pressure in a
high pressure protection system. They are fully isolated electri-
cally. They are mounted several feet apart on the process vessel.
They are identical in design. What is the estimated beta factor?
10.6 A sensor has a dangerous failure rate of 0.00005 failures per hour.
Two of these sensors are used in a 1oo2 safety configuration (the
system will not fail dangerous unless both sensors fail). The beta
factor is estimated to be 0.025. What is the dangerous common-
cause failure rate?

Answers to Exercises
10.1 Design errors, manufacturing errors, maintenance faults, and oper-
ational errors. Another source is environmental stress which
includes electrical stress, mechanical stress, chemical stress (corro-
sive atmospheres, salt air, etc.), physical stress and (in software)
heavy usage.
10.2 A software fault can fail two or more redundant controllers if both
use the same software and if the two controllers are subject to the
same software stress (e.g., identical inputs, identical timing, and
identical machine state). Common-cause software failures may be
reduced with diverse software and/or asynchronous operation
and timing.
10.3 Diversity is the use of different designs in redundant components
(modules, units) in order to reduce susceptibility to a common
stress. Diversity works best when the diverse components respond
differently to a common stress. Diversity does not work well when
the diverse components use basically the same technology such
that they respond the same way to a common stress.
10.4 Hardware diversity can best be achieved when redundant compo-
nents are of different technologies, such as a mechanical switch
and an electrical switch. Some level of diversity is achieved when
programmable electronic circuits are redundant with non-pro-
grammable circuits. Software diversity is accomplished through

Copyright International Society of Automation

using different programs to accomplish the same task. Software

diversity can also be accomplished by having redundant comput-
ers execute the same program differently (different operating
modes, different timing).
10.5 Use Table 10-1 to do the estimate. Assume full physical separation;
therefore, add 0.002 (equivalent to different cabinets) + 0.002 (fully
isolated) + 0.01 (no diversity) = 0.014.
10.6 Using the beta model, the dangerous common-cause failure rate is
0.025 0.00005 = 0.0000125 failures per hour.

References
1. Gray, J. A Census of Tandem Availability Between 1985 and
1990. IEEE Transactions on Reliability, Vol. 39, No. 4, Oct. 1990.

2. Siewiorek, D. P. Fault Tolerance in Commercial Computers.

Computer Magazine, IEEE Computer Society, Jul. 1990.

3. Dhillon, B. S. and Rayapati, S. N. Common-cause Failures in

Repairable Systems 1988 Proceedings of the Annual Reliability and
Maintainability Symposium. IEEE, 1988.

4. Hokstad, P. and Bodsberg, L. Reliability Model for Computerized

Safety Systems. 1989 Proceedings of the Annual Reliability and Main-
tainability Symposium. IEEE, 1989.

5. Bukowski, J. V. and Goble, W. M. An Extended Beta Model to

Quantize the Effects of Common Cause Stressors. Proceedings of
SAFECON 94, 1994.

6. Beckman, L. V. Match Redundant System Architectures with

Safety Requirements. Chemical Engineering Progress, December
1995, pp. 54-61.

7. Bukowski, J. V. and Lele, A. The Case for Architecture-Specific

Common Cause Failure Rates and How They Affect System Per-
formance Proceedings of the 1997 Annual Reliability and Maintain-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

ability Symposium, IEEE, 1997, pp. 153-158.

8. Goble, W. M. Control Systems Safety Evaluation & Reliability, 2nd Edi-

tion. Research Triangle Park: ISA, 1998.

Copyright International Society of Automation

9. Bukowski, J. V. and Chalupa, R. P. Calculating an Appropriate Multi-

plier For When Modeling Common Cause Failure in Triplex Systems.
Sellersville: exida, 2008.

10. Knight, J. C. and Leveson, N. G. An Experimental Evaluation of

the Assumption of Independence in Multi-Version Programming.
IEEE Transactions on Software Engineering, Vol. SE-12, No. 1. IEEE
Computer Society, 1986.

11. Brilliant, S. S., Knight, J. C., and Leveson, N. G. The Consistent

Comparison Problem in N-Version Software. IEEE Transactions on
Software Engineering, Vol.15, No.11, IEEE Computer Society, 1989.

12. Rutledge, P. J. and Mosleh, A. Dependent-Failures in Spacecraft:

Root Causes, Coupling Factors, Defenses, and Design Implica-
tions. 1995 Proceedings of the Annual Reliability and Maintainability
Symposium, IEEE, 1995.

13. Smith, D. J. Reliability, Maintainability and Risk, Practical Methods for

Engineers, Oxford, UK: Butterworth-Heinemann, 1993.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Software Failures
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

As we saw in Chapter 10, design errors, most often software design errors,
are a major source of common-cause failure. These failures are different
from most other failure types in that all software failures are inadvertently
designed into the system. Software does not wear out. Software is manu-
factured with no undetectable duplication error. There are no latent man-
ufacturing defects.

This situation leads some to believe that software failures cannot be mod-
eled using statistical techniques. The argument states that a computer pro-
gram always fails given a particular computer execution sequence.
Therefore, reliability (as mathematically defined) equals 0 for that execu-
tion sequence. Reliability equals one for input sequences that do not result
in failure. The system is completely deterministic, not probabilistic.

The argument is correct, but incomplete in that it ignores a reality of com-

puter operation. The number of possible execution sequences is very large
and can be modeled statistically. Inputs to the computer and other vari-
ables that impact execution sequence are not deterministic in nature. Prob-
ability is a concept that works perfectly on a specific input sequence for a
specific system. The input sequence is the stress event that may or may
not cause a particular software system to fail. When we consider a large
number of systems with a large number of input sequences, we realize
that some will fail. When we consider how often they fail, the concept of
failure rate applies.

An engineer was adding new control logic to a PLC. An analog input

module had been added to the system, and the input to that module was a

223
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
224 Control Systems Safety Evaluation and Reliability

slide wire resistor that normally converted mechanical motion into a 1 to 4

volt signal. Startup went well, and the additions passed all tests. Several
months later a mechanical stop failed and the wiper of the slide wire
traveled beyond its normal range, to 0 volts. The PLC stopped operation.
The PLC software failed with a divide by zero failure message. For this
failure:

Title: Software Failure divide by zero

Root Cause: Wiper travel beyond normal range
Failure Type: Systematic
Primary Stressor: Unexpected input to math software
Secondary Stressor: Mechanical stop failure; not enough software
stress testing

An industrial operators console had been operating in the plant for two
years with no problems. A new operator joined the staff and during one of
his first shifts the console stopped updating the display screen and
responding to operator commands shortly after an alarm acknowledg-
ment. The console was powered down and restarted perfectly. There were
no hardware failures. Since the manufacturer had over 500 units in the
field with 12 million total operating hours, it was hard to believe a signifi-
cant software fault existed in such a mature product.

The problem could not be duplicated after many hours of testing so a test
engineer visited the site and interviewed the operator. During this visit it

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
was observed that this guy is very fast on the keyboard. That input
allowed the problem to be traced. It was discovered that if an alarm
acknowledgment (Ack) key was struck within 32 milliseconds of the alarm
silence (Sil) key, a software routine would overwrite an area of memory and
cause the computer to crash. For this failure:

Title: Software Failure excessively rapid input

Root Cause: Ack key struck within 32 milliseconds of Sil key
Failure Type: Systematic
Primary Stressor: Unexpected input timing to software
Secondary Stressor: Not enough software stress testing

A computer failed after an operator requested that a particular data file be

displayed. This data file had been successfully displayed on the system
many times before. The problem was traced to a software module that
depending on file name size, did not always append a terminating null
zero to the end of the file name character string. Although the software
routine never appended the zero onto this file name before storing it in
memory, most of the time the file name was stored in memory that had
been clearedzeros were written into all locations. Under those circum-

Copyright International Society of Automation

stances the operation was always successful and the software fault was
masked. Occasionally, the dynamic memory allocation algorithm picked
memory that had not been cleared. The system failure occurred only when
the software module did not append the zero in combination with a mem-
ory allocation in an uncleared area of memory. For this failure:

Title: Software Failure file name size failure

Root Cause: Software did not append null terminator for certain
input character strings
Failure Type: Systematic
Primary Stressor: Character string size
Secondary Stressor: Dynamic memory allocation in uncleared area
of memory

A computer stopped operating after a message was received on its commu-

nication network. The message used the correct frame format but was
from an incompatible operating system that used different data formats.
The computer did not check for compatible data format. The data bits
within the frame were incorrectly interpreted, eventually causing the com-
puter to crash. For this failure:

Title: Software Failure communications message

Root Cause: Computers with incompatible operating systems con-
nected on same network
Failure Type: Systematic
Primary Stressor: Incompatible data input
Secondary Stressor: Not enough testing; not enough checking of
input data

There are many things in addition to inputs that cause software to fail.
Consider the above cases. Timing of input data was involved. Size of input
data caused a failure. Changing operation (dynamic memory allocation)
contributed. There are more. With such a wide variety of failure sources,
each of which can be treated statistically, there is a solid basis for the sta-
tistical analysis of the reliability and safety of software.

Software Failure Rate Trends

The systems we use to control and safeguard our industrial processes are
growing more complex as we learn better ways to automate and optimize.
Powerful new software tools have been the enablers for much of this
advancement. However, this means most of the complexity resides in soft-
ware. We now depend on this software, and our dependency is growing.
Software reliability, defined (for our present purposes) as the ability of
software to perform the expected function when needed, is essential. Yet
how many times do we experience trouble? The company network is
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

down. My computer hung again. I forgot to save my file and just blew
away four hours work. Our experience is far from perfect. As our soft-
ware dependency increases, our incentive for higher levels of software
reliability is greater.

Unfortunately, software failure rates are increasing [Ref. 1 and 2].

Although software fault avoidance techniques (known as software engi-
neering) are improving, the quantity and complexity of software are
increasing at a faster pace. A top-of-the-line personal computer purchased
by the author in 1980 provided a maximum of 8KB of RAM. Its replace-
ment, purchased in 1986, had 640KB of RAM. Another personal computer
purchased in 1991 had 8MB of RAM. The current machine has 8GB. Each
subsequent machine needed more memory because the software available
for each machine required the available memory. It appears software size
has grown by more than six orders of magnitude as measured by memory
size required to operate it. Even if complexity has grown only a portion of
that amount, the ability to avoid software faults has not kept pace. The
strength of software decreases with increasing complexity.

Stress-Strength View of Software Failures

As was presented in Chapter 3, all failures occur when some stress is
greater than the associated strength. This stress-strength concept can be
used on software.

Intuitively, we may guess that the software failure rate has some relation-
ship to the number of faults (human design errors; bugs) in the soft-
ware. Fault count is a strength factor. A program with few faults is
stronger than one with many faults. Software strength is also affected by
the amount of stress rejection designed into the software. Software that
checks for valid inputs and rejects invalid inputs will fail much less fre-
quently. Consider the communications example above. Although the soft-
ware did check for correct data frame format, it did not check for correct
data format. If it had, it is likely the failure would not have occurred.

We may also guess that the software failure rate relates to the way in
which the software is used. The usage stress to a software system is the
combination of inputs, timing of inputs and stored data seen by the CPU.
Inputs and the timing of inputs may be a function of other computer sys-
tems, operators, internal hardware, external hardware, or any combination
of these things.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Increasing Software Strength

Using the stress-strength concept as a conceptual guide, higher levels of
software reliability may be obtained by increasing software strength. Most
of the effort in software reliability to date has focused on increasing soft-
ware strength by preventing or removing software faults (i.e., software
process improvement). Software is created by human beings. The design
process is not perfect; mistakes are made. Many companies expend a lot of
resources to establish and monitor the software development process
(Ref. 37).

These efforts vary significantly. Attempts are currently being made to

audit software process effectiveness. The ISO 9000-3 standard establishes
required practices. The Software Engineering Institute has a five level soft-
ware maturity model (Ref. 4) where level 5 is intended to represent the
best process. By far the strongest effort in industry is the IEC 61508 (Ref. 8)
certification of instrumentation products. The IEC 61508 standard requires
a rigorous software development process that incorporates the best soft-
ware engineering techniques recognized by the IEC 61508 committee.
When a product gets a certification to IEC 61508, that means that a third
party expert has audited the software development process and made cer-
tain that the those creating the software are following the techniques
required by the standard. Conforming to IEC 61508 is one of the best ways
to ensure high strength software.

The number of faults in a software system is also related to the testability

of the system. Depending on the variability of software execution, a soft-
ware testing effort may or may not be effective. When software always
executes in the computer in a consistent manner, tests can be done to ver-
ify correct execution. When software executes differently each time it is
loaded, a test program cannot be complete. The number of test cases
explodes to virtual infinity. Execution variability increases with
dynamic memory allocation, number of CPU interrupts, number of tasks
in a multitasking environment, etc. After a software fault is discovered, it
is easy to say, Not enough testing. Whats wrong with the software test-
ing department? But under conditions of high execution variability, even
big, expensive software testing efforts are often ineffective. Software
strength (and therefore reliability and safety) is improved when execution
variability is reduced and testing can be more effective.

Two other important design elements increase software strength. These

are called software diagnostics and stress rejection.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Software diagnostics increase strength by doing automatic software verifi-

cation while a program is running. This not only helps prevent a software
failure but helps identify faults for subsequent repair. Software diagnos-

Copyright International Society of Automation

tics work much like hardware reference diagnostics. Certain conditions

are expected and the software checks to make sure these conditions are
present. Some software diagnostic techniques like program flow control
are specified in IEC 61508. Diagnostics that meet IEC 61508 standards are
required in safety critical software certified by recognized third parties
such as exida and TV. What are these techniques?

Program flow control is a technique in which each software segment that

executes in sequence writes an indicator (flag) to memory. Before execut-
ing, each software segment can check to verify the appropriate predeces-
sors have done their job. At the end of a sequence, a check can be done to
ensure all necessary software segments have run. When time-critical oper-
ations are performed, the flags can include time stamps. The time stamps
can be checked to verify maximum times have not been exceeded
(Figure 11-1).

Figure 11-1. Program Flow Control

Many other software diagnostic techniques are in use, especially in safety

critical software. Testing can be done to check for certain patterns in mem-
ory. The software can measure the time required for the execution of cer-
tain algorithms. The software may check to verify that memory has
changed as expected.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Software stress rejection techniques also increase software strength. The

concept is that potential stressors that might cause software failure (espe-
cially those that have not been tested during software quality assurance
testing) are filtered out before they can do damage. One example is plau-
sibility assertions. Plausibility assertions are required in approved safety
critical software. To implement plausibility assertions, the software checks
inputs to software and stored data. Communication messages are checked
for proper format and content before commands are executed. Data is
checked to verify that it is within an expected range. Pointers to memory
arrays are checked to verify they are within a valid range for the particular
array. Many other techniques dedicated to special applications are used
(Ref. 9).

Software Complexity
Why are software failure rates going up? Why does the strength of soft-
ware seem to be decreasing? As mentioned earlier, many think the answer
is complexity that is growing beyond the ability of the tools that help
humans deal with complexity. To understand this growth in complexity, a
view from the microprocessor unit (MPU) is needed. An MPU always
starts the same way. It looks at a particular place in memory and expects
to find a command. If that command is valid, the MPU begins the execu-
tion of a long series of commands from that point, reading inputs and gen-
erating outputs. There are three ways to view this process: as digital states,
as a sequence of digital states called a path, or as a series of inputs.

State Space
The first digital computers were state machinesdigital circuits that
moved from state to state depending on the input conditions and memory
contents. A state was represented by a number of binary bits stored on
flip-flop (bistable latch) circuits. One group of flip-flop circuits was called
the program counter. Other groups were called registers.

Outputs from the program counter selected a particular location in the

program memory. Logic circuits set or cleared bits in the flip-flops,
depending on inputs and logic signals from the memory. The bit settings
stored a calculation and determined the next state. The digital state
machine moved from state to state depending on contents of the memory
and depending on input. Each digital state is defined as a permutation of
bits from all the flip-flops in the machine. The collection of all possible
states is called the state space.

The machine was successful if it moved from state to state through the
state space as intended by the software engineer. But, if any bits were in
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,

error or if the inputs presented an unanticipated input, the machine was

Copyright International Society of Automation

not likely to follow the desired path or perform its intended function. This
constituted a failure.

The number of possible states, even in these early machines, was

extremely large. Consider a machine with three eight-bit registers, one 16-
bit register, and a six-bit status register. A total of 46 bits are present. The
total number of possible states (246) is over 70 trillion! In the many genera-
tions of machines developed since the first computers, complexity has cer-
tainly increased. Current generation machines have many more states.
The quantity has been called virtual infinity! Fortunately, if most bits
are always set the same way, the number of states is reduced to a manage-
able quantity.

Path Space
A sequence of states followed by a computer during the execution of a
program is called a path. The collection of all possible paths is called the
path space. A particular path is determined by the contents of memory and
the input received by the computer during the program execution. Simple
MPUs such as those installed in appliances repeatedly execute only a few
paths, computers execute many more.

Figure 11-2. Simple Program State Diagram

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Consider the simple program diagrammed in Figure 11-2. The program

measures the time between two switch inputs and calculates the speed of
a vehicle. How many paths can this simple program have? There are many
possibilities.

1234567
121234567
12121234567
1212121234567
123454567
12345454567
1234545454567
121212345454567

and many more.

However, if we count only a single repetition through each loop, the pro-
gram has three paths: 1234567, 121234567, and 123454567. A path that
includes only a single loop iteration is called a control flow path. Con-
trol flow path structures associated with common program constructs are
shown in Figure 11-3.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 11-3. Control Flow Path Structures

Paths are identified and counted in computer programs for many reasons,
including the development of program test strategies. Theoretically, every
path should be tested. If this is done, all software design faults should be
discovered. Test coverage is a measure of the percentage of paths tested.

Copyright International Society of Automation

In small programs and individual program modules, it is practical to test

all paths. However, as the size of a program increases, the number of
paths increases in a superlinear manner. In large programs the testing
would take tens of years even if only control flow paths were tested.

McCabe Complexity Metric

For any program that can be represented by a state diagram, one method
of calculating the number of control flow paths is called the McCabe Com-
plexity Metric [Ref. 10], which was developed from graph theory. NP, the
number of paths, is:
NP = e n + 2 (11-1)

The variable e represents the number of edges. Edges are represented on

the diagram by arrows. The variable n equals the number of nodes. Nodes
are shown on the diagram with circles. The program diagrammed in Fig-
ure 11-2 has a McCabe complexity metric of three (8 - 7 + 2). These three
control flow paths were listed in the previous section.

EXAMPLE 11-1

Problem: How many control flow paths are present in the program
diagrammed in Figure 11-4?

Solution: The number of nodes equals eight. The number of edges

equals nine. Using Equation 11-1:
NP = 9 8 + 2 = 3

Another simple program is diagrammed in Figure 11-4. Assume this is the

diagram of a program used in a toaster. The program scans the lever
switch until a user pushes the lever down, positioning the bread to be
toasted. The heater is then energized. The heat time is input from the
light-dark control on the side of the toaster. This time number should be
an integer between one and five. A value of one dictates a short heating
time of ten seconds, resulting in light toast. Higher values dictate propor-
tionately longer heating times. The program decrements the time number
until the remaining time equals zero. The heater is then turned off and
the toast pops up. The number of program control flow paths equals three
(Example 11-1). These three paths are:

12345678,
1212345678, and
1234565678.

All program control flow paths could be tested with three test runs.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 11-4. Program State Diagram

Data-Driven Paths
Control flow path testing does not account for path variations due to input
data. Whenever the number of loop iterations is controlled by input data,
each loop iteration count should be considered a different path. The pro-
gram from Figure 11-4 would have two paths for each possible data value
of time, an input obtained in step 4. For the input value of one, the paths
are 12345678 and 1212345678, because the loop from step 6 to step 5 does
not occur. Table 11-1 lists data-dependent paths for the valid data range of
1 through 5.

Table 11-1. Data Driven Program Paths

Data Path 1 Path 2
1 12345678 1212345678
2 1234565678 121234565678
3 123456565678 12123456565678
4 12345656565678 1212345656565678
5 1234565656565678 121234565656565678

Even testing all these 10 paths may not find all software design faults. Cer-
tain errors of omission are not detected until the program is tested with
unexpected inputs. What happens when an input of zero is received by

Copyright International Society of Automation

our toaster program? Assume that the time value is stored in an eight-bit
binary number. The program follows its path to step 5. The number is dec-
remented. A zero is decremented to a binary minus one, represented by
eight binary ones (11111111). This is the same representation as the num-
ber 255. In step 6, the time number will not equal zero. The program will
decrement 255 times! By the time the heater is turned off, the toast will be
ashes and the kitchen will be full of smoke. Most users would consider
this behavior to be product failure (a systematic failure). In order to fully
test this program, two paths would need to be tested for all possible input
numbers. With 256 input numbers, 512 paths would need to be tested.

Many quality conscious programmers would strengthen the design by

adding a valid input data check, a stress rejecter. After the time is input
in step 5, it is checked against valid bounds. If the number is less than one
or greater than five, the heater is immediately turned off, and the

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
untoasted bread is popped up. Invalid numbers are thus rejected. Figure
11-5 shows the modified state diagram. For each valid input, two paths
exist as before. For each input above the limit, only two paths exist (a total
of 10). For all inputs below the limit, only two paths exist. The path space
has been reduced. A total of only 14 paths need to be tested.

Figure 11-5. Program State Diagram with Data Range Checking

Copyright International Society of Automation

Asynchronous Functions
The path count goes up by orders of magnitude whenever a computer is
designed to perform more than one function in an asynchronous manner.
Most computers implement asynchronous functions with a feature known
as an interrupt. An electronic signal is sent to the computer. The com-
puter is designed in such a way that whenever the signal is received, the
computer stops following its path, saves enough information to return
later to the same spot in the original path, and then starts following a new
path. In effect, every time an interrupt occurs, a new path is created.

Imagine our new product development team has identified a need in the
toaster market. They have estimated thousands of additional units could
be sold if we add a clock to the side of the toaster. In order to keep the cost
down, we add only a digital display and a 1-second timer. The timer will
interrupt the microprocessor. The microprocessor will update the digital
clock display. The state diagram for our enhanced toaster program is
shown in Figure 11-6.

Since the interrupt that causes the computer to go to state B may occur at
any time relative to the path execution, the timer interrupt is asynchro-
nous to the main program. There are many more paths. Consider the case
when the time input equals one. We previously had two possible paths:
12123456789A and 123456789A. If state B can occur between any of the
other states, we now have the following possibilities for the path
12123456789A:

1B2123456789A, 12B123456789A, 121B23456789A,

1212B3456789A, 12123B456789A, 121234B56789A,
1212345B6789A, 12123456B789A, 121234567B89A,
1212345678B9A, 12123456789BA, and 12123456789AB.

Similar possibilities exist for the path 123456789A:

1B23456789A, 12B3456789A, 123B456789A, 1234B56789A,

12345B6789A, 123456B789A, 1234567B89A, 12345678B9A,
123456789BA, and 123456789AB.

Two paths became 22! The path space has increased by an order of
magnitude.

When an additional asynchronous task is added to a computer, it is possi-

ble that the second added task will interrupt the path of the first. The sec-
ond task can also interrupt the main path. The number of paths goes up
dramatically (this author estimates one or more orders of magnitude for

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 11-6. Enhanced Program State Diagram with Data Range Checking

each asynchronous task, depending on program size). The possibility of

testing all possible paths disappears.

Not testing all paths is a problem only if one of those paths results in some
unacceptable operation of the computer. This happens. Software design-
ers often cannot foresee all the possible the interactions of these asynchro-
nous tasks. The probability of interaction is increased whenever common
resources, such as memory, are shared by different tasks. Asynchronous
tasks should not be used as a design solution unless necessary. When
asynchronous tasks are necessary, resources such as memory or I/O
devices should not be shared.

We have now decided to enhance our toaster with a computer communi-

cations port so that a voice recognition system can automatically adjust
the toast time according to each person's preferences. The communica-
tions port can interrupt program flow at any time. Once the program

Copyright International Society of Automation

starts following the path required to receive an instruction from the voice
recognition system, that path cannot be interrupted. To solve this prob-
lem, the computer logic gives the communication interrupt the highest
priority. The communication protocol normally takes less than 1 second to
complete a message. Under error conditions, such as electrical noise, mes-
sages are repeated up to 10 times. Four messages are required for the voice
recognition system to deliver a complete command.

One morning Emily goes into the kitchen and says, Light toast, expect-
ing a 10-second heat time. She turns on the blender (generating lots of
electrical noise) and pushes down the toast lever. Tyree walks in just then
and says, Medium toast next. The message from the voice recognition
system interrupts the toaster as its main program is decrementing the
timer. Because of the blender noise, the messages take 40 seconds. The
toast, given 50 seconds of heat instead of 10 seconds, is burned. This sys-
tematic failure occurred because the computer followed an unanticipated,
untested path.

Input Space
So far, we have viewed computer operation in terms of state space and
path space. The computer is successful if it stays in successful states; it is
successful if it follows successful paths. Other states or other paths cause
systematic failure. Dr. J. V. Bukowski explains in Reference 5 that there is a
third way of looking at the problem: Consider the inputs. An input condi-
tion or sequence of input conditions will cause a particular path to be fol-
lowed. Programs accept input from the external world and the computers
memory. The input space is the collection of all possible input conditions or
sequences of input conditions [Ref. 5].

Programs often perform many functions. Inputs received by a program

vary depending on the function being performed. The input space view of
program operation offers an advantage in that program execution paths
can be estimated in terms of the functions being performed. This approach
can be useful in explaining why some computer installations have soft-
ware failure rates that are vastly different from those of others even
though they are running the same software. The functions being per-
formed are different. Usage is different. The paths being used are differ-
ent. A high software failure rate at one site does not indicate that every site
will have a high failure rate. The input spaces at the sites are likely to be
different. Such considerations can be very important when quantifying
software failure rates.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Software Reliability Modeling

Many different methods of quantifying software reliability exist [Ref. 11
15]. Early models measured time between failures. Other models have
counted failures during specific time intervals. Many variations of these
models have been developed.

These models serve a number of useful purposes. They have some poten-
tial to measure expected field failure rates. This information is needed for
accurate system level reliability evaluations. The models also provide
information to prospective buyers. They indicate the level of software
quality. In addition, the models provide information useful in comparing
new software development processes. There are many other uses for the
information.

One of the primary concepts behind software reliability modeling is that

the software design goes through a period of testing in which failures and
failure times are recorded. The test period is typically during a formal test
performed to uncover software design faults. Most models assume that
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

design faults are repaired as soon as they are discovered. Thus, this test
period represents a reliability growth process. The expected failure rate
drops as reliability grows.

Basic Model
Versions of the basic model were developed independently by Jelinski-
Moranda and Shooman and were first published in 1972 [Ref. 11]. In both
versions, the model depends on the measurement of time between fail-
ures. The model assumes there is some quantity of software design faults
at the beginning of the test, and that these faults are independent of each
other. The model assumes all faults are equally likely to cause failure and
an ideal repair process is in place, in which a detected fault is repaired in
negligible time and no new faults are introduced during the repair pro-
cess. It also assumes the failure rate is proportional to the current number
of faults in the program. A constant, k, relates failure rate to the number of
faults that remain in the program. This is illustrated in Figure 11-7.

During the test period, the failure rate as a function of cumulative number
of faults equals

( nc ) = k [ N0 nc ( t ) ] (11-2)

N0 is the number of faults at the beginning of the test period. The cumula-
tive number of faults that have been repaired is given by nc(t). Faults
remaining in the software at any time during the test period are calculated

Copyright International Society of Automation

Figure 11-7. Failure Rate versus Faults

by subtracting nc(t) from N0. After the test period, faults are not detected
and repaired. The failure rate then remains constant.

EXAMPLE 11-2

Problem: A program has an estimated 100 software design faults.

Previous programs have had a 0.01 ratio between remaining faults
and field failure rate per hour. The following data was gathered during
a formal software stress test:

First failure 5 hours

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Second failure 12 hours
Third failure 21 hours
Fourth failure 3 8 hours
Fifth failure 55 hours

Plot the failure rate as a function of time during the test.

Solution: Using Equation 11-2, we calculate the projected failure rate

for each time interval between failure detection. Between failure
detection times, the failure rate is constant.

After the first failure detection, the failure rate is

( 1 ) = 0.01 ( 100 1 ) = 0.99

After the second failure detection, the failure rate is

( 2 ) = 0.01 ( 100 2 ) = 0.98

Copyright International Society of Automation

EXAMPLE 11-2 continued

After the third failure detection, the failure rate is

( 3 ) = 0.01 ( 100 3 ) = 0.97

After the fourth failure detection, the failure rate is

( 4 ) = 0.01 ( 100 4 ) = 0.96

After the fifth failure detection, the failure rate is

( 5 ) = 0.01 ( 100 5 ) = 0.95

These values are plotted in Figure 11-8.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 11-8. Failure Rate versus Test Hours

Basic Execution Time Model

The basic execution time (BET) model is an enhanced version of the basic
model [Ref. 12]. This model makes a careful distinction between computer
execution time and calendar time. Although these times are equal when
the computer spends all of its time on one program, the times are not
equal when a program is run on a time-shared system. The assumptions
used for this model are similar to those of the basic model.

Copyright International Society of Automation

During the test phase of a project, the failure rate equals

( nc ) = k [ N0 nc ( ) ] (11-3)

The Greek lower case letter tau () represents execution time. Normally,
for control system computers, execution time equals calendar time, since
the computers are dedicated to the control function.

During testing, the cumulative number of detected faults (failures) as a

function of execution time is recorded. However, this information does
not directly provide the values of N0 and k that are needed to use Equation
11-3. In order to obtain the necessary values, an analytical formula for nc
as a function of execution time must be developed. Given such a formula,
parameters are chosen that provide a best-fit curve for our data.

Proceeding with the model development, it is noted that with both the
basic model and the basic execution time model, it is assumed the detec-
tion of any fault is equally likely. This means in any given time period, a
constant percentage of faults will be detected. To illustrate the concept, we
estimate a program has 1000 faults (5000 lines of source times 0.2 average
faults per thousand lines). Assume each week we find a constant 25 per-
cent of the remaining faults. Table 11-2 lists the results.

Table 11-2. Program Formal Test Data

Remaining Cumulative
Week Faults Found
Faults Faults Found
0 1000 0 0
1 750 250 250
2 563 187 437
3 423 140 577
4 318 105 682
5 239 79 761
6 180 59 820
7 135 45 865
8 102 33 898
9 77 25 923
10 58 19 942
11 44 14 956
12 33 11 967
13 25 8 975
14 19 6 981
15 15 4 985
16 12 3 988
17 9 3 991
18 7 2 993
19 6 1 994
20 5 1 995

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

If we plot the cumulative number of faults found, we see a curve that

appears to display an exponential rise. A plot of the faults remaining
appears to be exponentially decreasing (Figure 11-9). Since the failure rate
is proportional to the faults remaining, an exponentially decreasing failure
rate due to remaining faults would be expected.

Figure 11-9. Failure Rate versus Test Hours

Since the number of faults detected (and repaired) each period is a con-
stant percentage of faults remaining, we can state:

NDetected = kNRemaining (11-4)

The number of faults remaining equals the total starting number of faults
(N0) minus the cumulative quantity of repaired faults. Therefore:

NDetected = k[N0 nc()] (11-5)

The change in cumulative quantity of repaired faults in each time period

equals the number detected during that time period. This change can be
substituted into Equation 11-5, giving the equation:

dn
--------c = k [ N 0 n c ( ) ] (11-6)
d

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

This differential equation can be solved to obtain the cumulative quantity

of repaired faults as a function of execution time.

k
nc ( ) = N0 [ 1 e ] (11-7)

Substituting Equation 11-7 into Equation 11-3 yields:

k
( ) = kN 0 e (11-8)

The failure rate is indeed an exponentially decreasing function of execu-

tion time.

The parameters k and N0 of Equation 11-8 can be estimated by curve fit-

ting. The test data is plotted on a graph. A curve generated using Equation
11-8 is also plotted. The curve fit can be done graphically. Several mathe-
matical methods to fit data optimally also exist. Spreadsheet programs
have such data regression methods available.

EXAMPLE 11-3

Problem: We are given test data obtained from a software test in

which several testers exercised the software. Each day, the total
number of test hours was recorded along with the total number of
failures. Calculate the software field failure rate using a Basic
Execution Time model. Assume that test time (calendar time) equals
execution time. The test data is given in Table 11-3.

Solution: The data is plotted in Figure 11-10. Using a simple

spreadsheet, curves generated using Equation 11-8 are
superimposed on the graph. The curve is varied until it seems to fit
well.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

The parameters used in the curve are N0 equals 140 and k equals
0.01. Using these parameters along with our total execution time of
400 hours in Equation 11-8, a failure rate is obtained.

(400) = 1.4 e-0.01 400 = 0.026

The test data shows that at the end of the 400 hour test period a field
failure rate of 0.026 failures/hour is expected. It is estimated that 140
faults were originally in the program. Since we have discovered 121
faults, there are an estimated 19 faults remaining in the program.

Copyright International Society of Automation

Table 11-3. Data from Software Test

Day Test Hrs Cum Test Hrs Failures Failures/Hr
1 20 20 22 1.1
2 16 36 8 0.5
3 18 54 18 1
4 22 76 12 0.545455
5 25 101 9 0.36
6 14 115 4 0.285714
7 6 121 4 0.666667
8 14 135 12 0.857143
9 14 149 2 0.142857
10 16 165 9 0.5625
11 18 183 6 0.333333
12 22 205 3 0.136364
13 13 218 4 0.307692
14 4 222 2 0.5
15 30 252 2 0.066667
16 28 280 1 0.035714
17 60 340 2 0.033333
18 24 364 1 0.041667
19 22 386 0 0
20 14 400 0 0
Failure Rate

Actual Data
Actual Data

Fitted
Fitted Curve
Curve

Execution Time
Figure 11-10. Failure Rate versus Test Time

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 11-4

Problem: The boss will not accept our graphic curve fitting technique.
Redo Example 11-3 using numerical linear data regression.

Solution: The spreadsheet we are using has a linear data regression

capability. We can take the natural log of our data and convert it into a
linear line. For example, if an exponential formula has the form:

B
( ) = Ae (11-9)

then taking the natural log of both sides of the equation results in

ln ( ) = ln A + B

The slope of the line is calculated to be -0.0099 and the constant is

0.3299. Taking the exponent of 0.3299 results in an answer of
1.390834, which equals kN0. The slope of the line is -0.0099, which is
the value of k.

Logarithmic Poisson Model

One of the assumptions of the basic execution time model is all faults are
equally likely to be detected and repaired. In many programs, especially
those that are large in size and have many asynchronous tasks, this
assumption cannot be made. The author has multiple examples of
software faults that have been present in programs for many years.
Clearly, faults that are not found after hundreds of hours of testing will be
much harder to find than those quickly discovered after only a few hours
of testing.

The Logarithmic Poisson (LP) model assumes that failure rate variation as
a function of cumulative repaired faults is an exponential.

n c
( nc ) = 0 e (11-10)

where is known as the failure intensity decay parameter. 0 is the ini-

tial failure rate. This equation should be compared with Equation 11-3.
The difference between the LP model and the BET model may be seen in
Figure 11-11.

For the LP model, the graph indicates that some faults are more likely to
cause failures. When they are removed, the failure rate drops fast. Other
faults are less likely to cause failure. As they are removed, the failure rate
drops more slowly. We should also notice that the rate does not drop sig-
nificantly past a certain point in the fault removal process. This is charac-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 11-11. Failure Rate versus Faults

teristic of a process in which some faults will be introduced when errors

are repaired. It is realistic to expect some new faults to be introduced
whenever changes are made to software.

The cumulative number of repaired faults for the LP model is given by the
equation:

1
n c ( ) = ---- ln ( 0 + 1 ) (11-11)

This should be compared Equation 11-7 for the BET model. The cumula-
tive number of repaired faults for both models is plotted in Figure 11-12.

The expected number of repaired faults tends towards No for the BET
model as execution time increases. For the LP model, however, the
expected number of repaired faults tends toward infinity. This is charac-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

teristic of large, complex programs with a path space of virtual infinity.

If new faults are introduced when old faults are repaired, the cumulative
number of repaired faults will tend toward infinity. It has been argued
that programs used by many persons for a long period of time are more
reliable. Programs that follow an LP reliability model will not support this
argument. Given a sufficiently large path space, the faults will never be
repaired beyond a certain point.

Failure rate, as a function of execution time, can be obtained for the LP

model. That solution provides the equation:

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 11-12. Faults versus Test Time

( ) = 0
(11-12)

0 + 1

The LP model is used in a manner similar to the BET model. Failure times
are recorded during product tests. Parameters are estimated by best case
curve fitting. Figure 11-13 shows the data from Table 11-2 fitted with both
a BET curve and an LP curve.

Operational Profile
Earlier in the chapter we discussed the concept of an input spacethe col-
lection of all possible inputs. These inputs cause the computer to execute
different paths through its program. Some failures occur only when cer-
tain inputs are received by the computer. Often, sets of inputs are grouped
according to computer function. Commands that tell the computer to exe-
cute certain functions, followed by input data, represent a logical group-
ing within the input space.

Computer usage is different at different sites. This can be demonstrated by

a probability profile of input groupings. This profile is known as an opera-
tional profile. An example of an operational profile would be a scaled ver-
sion of Figure 2-1, the computer usage histogram. Each computer function
is characterized by a distinct set of inputs.

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 11-13. Failure Rate versus Test Time

Estimates of the operational profile for a particular computer program are

important. They can be used to direct testing efforts. More testing can be
done on sections of the program that will get highest probability of usage.
Testing on each section of the program can be modeled separately. A fail-
ure rate can be calculated for each section. The overall failure rate could be
calculated using a weighted average based on the operational profile. One
strong advantage of this approach is that the overall failure rate could be
recalculated for each customer that has a distinctively different opera-
tional profile.

Software Reliability Model Assumptions

Several assumptions were mentioned when the concepts behind software
reliability models were presented earlier in the chapter. Since certain of
these assumptions violate the common sense of professional program-
mers, the assumptions and their consequences need to be discussed in
more detail. It will be seen that many assumptions are indeed violated
during practical software testing. However, as A. L. Goel explains in Ref-
erence 13, none of the violations prevent us from using the models, espe-
cially if we understand the effects of assumption violation and account for
them.

Faults Are Independent

Independence is generally assumed in software reliability models in order
to simplify the model. Dependent probabilities are not required. A study
of how faults are introduced reveals many reasons: misunderstood func-

Copyright International Society of Automation

tionality, design error, coding error, etc. These reasons generally result in
independent faults. Occasionally, dependent faults occur, but we have no
strong reason to doubt this assumption.

Times between Failures Are Independent

This assumption would be valid if the input data set (or path) for each test
were chosen randomly. Some testing is done randomly, but a majority of
software testing is done by following carefully made plans. Whenever a
failure occurs, testing in that area is often intensified.

We conclude then that this assumption is not valid for most testing pro-
cesses. Two effects have been attributed to this assumption violation. First,
the data is generally noisy. The actual data in Figures 11-10 and 11-13,
for example, bounces around quite a bit. This could result from nonran-
dom testing.

Restarts also result from nonrandom testing. As a new area of software

is tested, the failure rate suddenly goes up (see Figure 11-14). A similar
result has been observed when new test personnel are added to a project.
These results do not prevent us from using the models; we must merely
realize that our plots will be noisy and not reject the model because of the
noise.

Figure 11-14. Failure Rate versus Test Time

--``,,`,,,`,,`,`,,,```,,,

Copyright International Society of Automation

Detected Fault Removed in Negligible Time

It does require time to recode or redesign software to remove faults. This
time can be quite variable. The assumption of detected fault removal in
negligible time is typically violated in real projects. This is not a serious
problem, however. Testing can usually continue with little effect. In the
worst case, duplicate faults will be reported. These can be deleted from the
database.

No New Faults Are Introduced

The BET model assumes that new faults are not introduced when repairs
are made to the software. Experience has shown that some new faults can
be introduced when software repairs are made. This is more likely to be
true in larger, more complex programs and programs with poor documen-
tation.

New fault introduction does not prevent us from using the model; it
merely changes the parameters. If the new fault introduction rate is simi-
lar to the fault removal rate or larger, the failure rate will not decline. This
is an indication that something more drastic needs to be done. Redesign or
abandonment of the program is in order.

Faults Are Equal

The BET model also assumes that all faults are equally likely to cause fail-
ure. Experience has shown this not to be true. Some key faults can cause
failure quickly. Other faults may remain hidden for years, their effects
insufficient to cause failure until specific circumstances converge.

Although the assumption is not met in practice, again the effect is mini-
mal. One such effect is that actual data does not correlate well with the
best-fit curve. When this happens, alternatives exist. We may switch to the
LP model or model each major element of the operational profile as a sep-
arate program.

Model Usage
Several studies have been done using these models, as well as a few oth-
ers, on actual software test data. The results have been moderately suc-
cessful. Usually, one model fits better than others for a particular piece of
software. The characteristics of the software that allow one model or
another to work best have not yet clearly been identified. Some patterns
seem to be forming.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The LP model seems to fit better for larger, more complex programs, espe-
cially when the operational profile is not uniform. The BET model seems
to fit better when program size is changing. Research continues in this
important area.

Exercises
11.1 List several system failures that are caused by software.
11.2 Describe how software failures can be modeled statistically.
11.3 What strength factors can be attributed to a software design?
11.4 What stress factors cause a program to fail?
11.5 Estimate how much software complexity has grown in the last
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

decade.
11.6 Calculate the number of control flow paths in the program dia-
grammed in Figure 11-15.

Figure 11-15. Program States

Copyright International Society of Automation

11.7 Are all software design faults detected by executing all control
paths?
11.8 How many control flow paths are present in the program dia-
grammed in Figure 11-7?
11.9 We have tested a program for 500 execution hours. A BET model
shows a good correlation using parameters of k = 0.01 and N0 =
372. If we continue testing and repairing problems for another 200
execution hours, what field failure rate could be expected?
11.10 The program of Exercise 11.9 has a customer-specified maximum
software failure rate of 0.0001 failures/hour. How many hours of
testing and repair will be required to meet the specification?
11.11 We have tested a program for 1000 execution hours. An LP model
shows a good correlation using parameters of 0 equals 2 and equals
0.03. What is the expected field failure rate?
11.12 We have a software reliability goal of 0.0001 failures per hour for
the program of Exercise 11.11. How many test hours are required?
11.13 Is it reasonable to expect the program from Exercise 11.11 to ever
achieve the reliability goal of 0.0001 failures per hour if new faults
are added when old faults are repaired?

Answers to Exercises
11.1 The answer can vary according to experience. Authors list: crash
during word processor use that destroyed chapter 4, PC hung up
during email file transfer, PC hung up during printing with three
applications open, PC crashed after receiving email message,
11.2 Software failures can be modeled statistically because much like
hardware, the failure sources create stress that can be character-
ized as random variables.
11.3 Software strength is increased when fewer design faults are
present, when stress rejection is added to the software, when
software execution is consistent (less variableno multitasking,
few or no interrupts, little or no dynamic memory allocation), and
when software diagnostics operate on-line to detect and report
faults.
11.4 Stress on a software program includes inputs (especially unex-
pected inputs), the timing of the inputs, the contents of memory,
and the state of the machine.
11.5 Software complexity appears to have grown over six orders of
magnitude.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

11.6 e = 32, n = 21 therefore 32 - 21 + 2 = 13.

11.7 No, some faults occur when certain input data occurs.
11.8 e = 33, n = 11, e n + 2 = 24. Compare that to the number in Figure
11-6! An interrupt adds many possible control paths that must be
tested to get good test coverage.
11.9 Test time = 700 hours, k = 0.01, No = 372, = 0.00339.
11.10 Approximately 1055 hours of testing are required.
11.11 (1000) = 0.0328.
11.12 Approximately 334,000 hours of testing would be required.
11.13 Not likely; perhaps the program should be simplified.

References
1. Gray, J. A Census of Tandem Availability Between 1985 and
1990. IEEE Transactions on Reliability, Vol. 39, No. 4, Oct. 1990.

2. Siewiorek, D. P. Fault Tolerance in Commercial Computers.

Computer Magazine, IEEE Computer Society, July 1990.

3. Crawford, S. G. and Fallah, M. H. Software Development Process

Audits A General Procedure. Proceedings of the 8th International
Conference On Software Engineering. IEEE, 1985.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

4. Humphrey, W. S. Characterizing the Software Process: A Matu-

rity Framework. IEEE Software, IEEE, Mar. 1988.

5. Bukowski, J. V. Evaluating Software Test Results: A New

Approach. 1987 Proceedings of the Annual Reliability and Maintain-
ability Symposium. IEEE, 1987.

6. Bukowski, J. V. and Goble, W. M. Practical Lessons for Improving

Software Quality. 1990 Proceedings of the Annual Reliability and
Maintainability Symposium. New York: IEEE, 1990.

7. Bukowski, J. V. and Goble, W. M. Software Reliability Feedback:

A Physics of Failure Approach. 1992 Proceedings of the Annual Reli-
ability and Maintainability Symposium. New York: IEEE, 1992.

8. IEC 61508-2000. Functional safety of electrical/electronic/programmable

electronic safety-related systems. Geneva: International Electrotechni-
cal Commission, 2000.

Copyright International Society of Automation

9. Goble, W. M. Techniques for Achieving Reliability in Safety PLC

Embedded Software. Sellersville: exida, (www.exida.com), 2004.

10. McCabe, T. J. A Complexity Measure, IEEE Transactions on Soft-

ware Engineering, Vol. SE-2, IEEE, 1976.

11. Shooman, M. L. Software Reliability: A Historical Perspective.

IEEE Transactions on Reliability, Vol. R-33, No. 1, IEEE, 1984.

12. Musa, J. D. A Theory of Software Reliability and Its Application.

IEEE Transactions on Software Engineering, Vol. SE-1, No. 3, IEEE,
Sep. 1971.

13. Goel, A. L. Software Reliability Models: Assumptions, Limita-

tions, and Applicability. IEEE Transactions on Software Engineering,
Vol. SE-11, IEEE, Dec. 1985.

14. Musa, J. D., Iannino, A., and Okumoto, K. Software Reliability: Mea-
surement, Prediction, Application. NY: McGraw-Hill, 1987.

15. Musa, J. D. and Everett, W. W. Software Reliability Engineering:

Technology for the 1990s. IEEE Software, IEEE Computer Society,
Nov. 1990.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Key Issues
The amount of detail to be included in a safety and reliability model
depends on the objectives of the modeling. The amount of effort (and
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

cost!) is affected by the level of detail in the model when modeling is done
manually, but it should be noted that for a given level of detail, costs
depend much more on the available computer tools.

Key issues to be considered include:

1. The degree of redundancy
2. Common cause failures in redundant systems
3. Availability of on-line diagnostics in the equipment being modeled
4. Imperfect inspection and repair
5. Failures of on-line diagnostics
6. Probability of initial equipment failure

One objective of modeling is to account for the important things and to

ignore the rest in order to keep the model to a manageable size. This may
sound easy, but it is not. Sometimes the important things are not obvious.
This chapter will show that all of the factors listed above can be very
important and must be included unless careful justification shows
otherwise.

In this chapter most examples will model a simple system consisting of

two switches wired in series (1oo2). Each switch has two failure modes
open circuit (safe, de-energized) and short circuit (dangerous, energized).

255
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
256 Control Systems Safety Evaluation and Reliability

These models will be analyzed for the probability of failing dangerous

(both switches must fail dangerous, short circuit). In safety instrumented
systems (SIS), this measure is called probability of failure on demand
(PFD). In low demand applications where a dangerous condition occurs
infrequently, the average of PFD (PFDavg) is a valuable metric. Another
way to express the impact of PFDavg in low demand mode is called the
risk reduction factor (RRF). The RRF equals 1 divided by PFDavg. Various
levels of detail will be added using different modeling techniques, and the
answers will be compared.

Probability Approximations
When the failure rate of a component, module, or unit is known, the prob-
ability of failure for a given time interval is approximated by multiplying
the failure rate times the time interval (See A Useful Approximation,
Chapter 4). In safety instrumented systems it is a good practice to periodi-
cally inspect the system for failures. In this situation the time interval used
in the calculation is the inspection period. While this method is an approx-
imation, errors tend to be in the pessimistic direction and therefore the
method is conservative. The failure probabilities for system components
can be used in fault tree diagrams to calculate system failure probabilities.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
System fails
F(4380)= 0.00048

Unit A fails Unit B fails

F(4380)= 0.0219 F(4380)= 0.0219

Figure 12-1. 1oo2 System Fault Tree

Figure 12-1 shows a simple ideal fault tree for the 1oo2 series wired switch
system. The probability of short circuit (dangerous) failure for one switch
is approximated by multiplying the short circuit failure rate, D, times the
periodic inspection interval, TI (time interval). The system fails short cir-
cuit only if both switches A and B fail short circuit. Therefore, the approxi-
mate probability of the system failing short circuit is given by

Copyright International Society of Automation

PFD = (D TI)2 (12-1)

This simple fault tree assumes that there is no common cause. It assumes
that perfect inspection and perfect repairs are made at each inspection
period. It does not account for more rapid repairs made if diagnostics
detect the failure. The model assumes constant failure rates.

If the two switches had different failure rates (diverse designs) then the
equation would be

PFD = (1D TI) (2D TI) (12-2)

EXAMPLE 12-1

Problem: A system consists of two switches wired in series (1oo2).

Both must fail short circuit for the system to fail short circuit. The
constant failure rate for short circuit failures is 0.000005 failures per
hour. The system operates for time intervals (TI) of six months and is
then inspected and restored to full operation if any failures are found.
Using a probability approximation technique, what is the probability
that the system will fail dangerous during the six month operating

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
time interval? What is the risk reduction factor (RRF)?

Solution: The approximate probability of dangerous failure for one

switch is found by using Equation 4-20. PFD(4380) = D TI =
0.000005 4380 = 0.0219.

When this approximation is compared to the precise formula for

probability of failure (Equation 4-16), the result is PFD(4380) =
1 e(-0.00005 4380) = 0.0216. This result shows that the
approximation does not introduce much error and that the error is in a
conservative, pessimistic direction.

Since both units must fail for the system to fail, system probability of
failure using Equation 12-1 is

PFD(4380) = (D TI)2 = 0.0219 0.0219 = 0.00048

Probability of Failure on Demand Average

As has been mentioned, in low demand safety protection applications, a
potentially dangerous condition occurs infrequently. It may come right
before a periodic inspection and test. It may come right after a periodic
inspection and test, or any time in between. Given that we calculate proba-
bility of failure as a function of time, which specific time in the time inter-
val is appropriate? The issue can be solved by taking the average

Copyright International Society of Automation

probability of failure over the entire time interval. This metric is called
PFDavg.

The approximate PFDavg over the time interval for the 1oo2 system is
given by:

t
1 D 2

PFDavg ( t ) = --- ( t' ) dt'
t
0

substituting t = TI

TI
1 D 2
PFDavg ( TI ) = ------
TI ( t' ) dt
0

and integrating

3 TI
1 D 2t
PFDavg ( TI ) = ------ ( ) ----
TI 3 0

which evaluates to

( ) D 2
TI 2 (12-3)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PFDavg =
3

NOTE: As will be shown later in this chapter, this simplified equation rep-
resents an ideal situation. This simplified equation should not be used for
safety design verification.

EXAMPLE 12-2

Problem: Using a probability approximation technique, what are the

PFDavg and the RRF using the data from Example 12-1?

Solution: Using Equation 12-3, PFDavg = (0.000005 4380)2 / 3 =

0.00016 and RRF = 6255.

The Markov model for this simple system is shown in Figure 12-2.

Copyright International Society of Automation

2L D LD

OK Degraded - System Fail

1 Fail - 2 Fail
0
1 2
Figure 12-2. 1oo2 System Markov Model

The P matrix for this model is

1 2 D 2 D 0

P = 0 1 D D
0 0 1

Solving this simple model using a P matrix multiplication technique yields
the PFD as a function of the operating time interval of 4380 hours as
shown in Figure 12-3.

0.0005

0.0004
PFD

0.0003
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

0.0002

0.0001

0
4380 Hours

Operating Time Interval

Figure 12-3. PFD 1oo2 System Markov Model

Taking the average of these PFD numbers gives a result of 0.0001585. The
Markov model RRF is 6309. This is higher than the value of Example 12-2
(6255). This result again shows the impact of the simplified equation
approximation, which does not affect the numeric Markov solution. How-

Copyright International Society of Automation

ever, the differences are not significant in this case. Note: differences do
become significant for longer time intervals and higher failure rates.

Probability Approximation with Common Cause

Common-cause failures are significant and must be considered in any reli-
ability or safety model of a redundant system (Chapter 10). The dangerous
failure rate is divided into normal (independent) failures and common-
cause failures. The simplest model to use for this is the beta model. The
failure rate is multiplied times to get the common-cause failure rate. The
failure rate is multiplied by (1 - ) to get the normal failure rate.

DN = (1 - ) D (12-4)

and

DC = D (12-5)

A new fault tree is shown in Figure 12-4. The formula for approximate
PFD using this fault tree is

PFD = DN ( ) 2
TI 2 + DC TI (12-6)

System fails

A and B fail
Unit A fails Unit B fails
Common Cause

Figure 12-4. 1oo2 System Fault Tree with Common Cause

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The equation for PFDavg for Figure 12-4 can be derived by integrating the
PFD equation and dividing by the time interval. The average approxima-
tion with common cause is given by:

TI
1 DN 2 DC

PFDavg ( TI ) = ------ [ ( t' ) + t' ] dt'
TI
0

and integrating

3 2 TI
1 DN 2 t DC t
PFDavg ( TI ) = ------ ( ) ---- + ----
TI 3 2 0

which evaluates to

t2 t TI
( )
DN 2

3
+ DC
2 0

Substituting the integration limits:

PFDavg =
( )
DN 2
TI 2
+
DC TI (12-7)
3 2

EXAMPLE 12-3
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Problem: Using a probability approximation technique, recalculate

the PFDavg and the RRF with data from Example 12-2 accounting for
common cause using a beta factor of 0.1.

Solution: The failure rate of 0.000005 failures per hour must be

divided into normal failures and common-cause failures. Using
Equations 12-4 and 12-5:
DN = (1 - ) D = 0.0000045 and DC = D = 0.0000005
Using Equation 12-7, the system PFDavg = 0.0001295 + 0.001095 =
0.0012245.
It should be noted the common-cause contribution to the PFDavg is
the majority. This happens in most models with realistic common
cause estimates.
The RRF = 1/PFDavg = 816. This result should be compared with
Example 12-2 where RRF was calculated without common cause for
the same failure rate data. That result was 6255. This shows the
importance of including common cause in all reliability and safety
models, especially in safety instrumented systems.

Copyright International Society of Automation

2LDN LD

OK Degraded - System Fail

1 Fail - 2 Fail
0
1 2

LDC
Figure 12-5. 1oo2 System Markov Model with Common Cause

The Markov model for a 1oo2 system with common cause is shown in Fig-
ure 12-5. The P matrix for this model is

1 (2 DN + DC ) 2 DN DC

P = 0 1 D D
0 0 1

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Solving this simple model using a P matrix multiplication technique yields

the PFD as a function of the operating time interval of 4380 hours as
shown in Figure 12-6.

0.003

0.002
PFD

0.001

0
4380 Hours
Operating Time Interval

Figure 12-6. PFD 1oo2 System Markov Model with Common Cause

Taking the average of these PFD numbers gives a result of 0.00116. The
RRF is 860. This is higher than the value of Example 12-3 (816).

Copyright International Society of Automation

On-line Diagnostics
As we have seen, system reliability and safety can be substantially
improved when automatic diagnostics are programmed into a system to
detect component, module, or unit failures. This can also benefit the sys-
tem by reducing actual repair time, the time between failure detection,
and the completion of repairs. Diagnostics can identify and annunciate
failures. Repairs can be made quickly as the diagnostics indicate exactly
where to look for the problem.

Imagine the two switches used in the 1oo2 example have microcomputers
that periodically open the switch for a few microseconds and check to ver-
ify that the current flowing through the switch begins to drop. With such a
diagnostic it is possible to detect many of the short circuit failure modes in
the switch.

If diagnostic capability exists in a system, the model should account for

that capability. To do that, the failure rates are divided into those detected
by the on-line diagnostics and those not detected by the on-line diagnos-
tics. If the failure rates have been obtained from an FMEDA (Chapter 9),
the failure rates are already separated into these categories. Alternatively,
a diagnostic coverage factor may be shown. This can also be used to sep-
arate the failure rates. For the two failure modes, safe and dangerous, four
failure rate categories result:

SU - Safe, undetected
SD - Safe, detected
DU - Dangerous, undetected
DD - Dangerous, detected

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The appropriate diagnostic coverage factors used are as follows:

SU = (1 - CS) S (12-8)

SD = CS S (12-9)

DU = (1 -CD) D (12-10)

and

DD = CD D (12-11)

where CS is the safe coverage factor and CD is the dangerous coverage

factor.

Copyright International Society of Automation

EXAMPLE 12-4

Problem: A switch with electronic diagnostic circuitry is used in a

dual series wired system (1oo2). The switch can detect most
dangerous failures and has a dangerous failure mode coverage factor
of 0.95. The dangerous failure rate is 0.000005. What is the
dangerous detected failure rate? What is the dangerous undetected
failure rate?

Solution: Using Equation 12-11, the dangerous undetected failure

rate equals 0.000005 (1 - 0.95) = 0.00000025. Using Equation 12-
10, the dangerous detected failure rate equals 0.000005 0.95 =
0.00000475.

System fails System fails

Dangerous Dangerous

Switch A fails Switch B fails

Switch A fails Switch B fails Switch A fails Switch B fails

A fails DD A fails DU B fails DD B fails DU
DU DU DU DD

Switch A fails Switch B fails Switch A fails Switch B fails

DD DU DD DD

Figure 12-7. Fault Trees, Accounting for Diagnostics

Fault Tree Analysis with Diagnostics (No Common Cause)

Figure 12-7 shows two fault tree diagrams for the 1oo2 dual switch sys-
tem. They are drawn in two different formats to show how fault trees can
appear different but show the same logic. If each switch has two possible
dangerous failure modes, the system has four combinations of dangerous
failures. These are shown as the four AND gates in the left side fault tree.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

The system fails when one combination of the two switches fails. The OR
gate in the left side fault tree accounts for this.

An equation can be developed from the fault tree. The probability that a
switch will be failed dangerous if the failure is undetected is approxi-
mated by multiplying the dangerous undetected failure rate times the
inspection interval (DU TI). The probability that a switch will be failed
dangerous when the failure is detected actually depends on maintenance
policy. The switch will be failed dangerous only until it is repaired. The
shorter the repair time, the lower the probability of dangerous failure. This

Copyright International Society of Automation

is approximated by multiplying the dangerous detected failure rate times

the actual average repair time (DD RT). The full equation for system
dangerous failure is given by

PFD = DU TI ( )2
(
+ 2 DD RT DU TI + DD RT) ( ) 2
(12-12)

The equation for PFDavg can be derived by integrating the PFD equation
and dividing by the time interval. This approximation is given by:

TI
1 DU 2 DD DU DD 2

PFDavg = ------ [ ( t ) + 2 RTt + ( RT ) ] dt
TI
0

which evaluates to

3 2 TI
1 DU 2 t DD DU t DD 2
PFDavg = ------ ( ) ---- + 2 RT ---- + ( RT ) t
TI 3 2 0

Substituting the integration limits:

PFDavg =
( ) DU 2
TI 2
+ DD DU RT TI + DD RT ( )
2
(12-13)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
3

EXAMPLE 12-5

Problem: A 1oo2 dual switch system operates for six months

between periodic inspections. During this period, when failures are
detected and annunciated, repairs are made within 72 hours.
Calculate the PFDavg and RRF using PFDavg for the dual switch
system using a dangerous failure rate of 0.000005 and a dangerous
coverage factor of 0.95.

Solution: The dangerous failure rate of a switch is divided into

dangerous detected and dangerous undetected.

DD = 0.00000475 failures per hour

DU = 0.00000025 failures per hour

Substitute failure rate and repair time data into Equation 12-13 to
obtain PFDavg.

PFDavg = 0.00000089, RRF = 1,122,171

Copyright International Society of Automation

EXAMPLE 12-6

Problem: The first term of Equation 12-13 appears to dominate the

result. Recalculate the PFDavg for Example 12-5 accounting only for
the dangerous undetected failures and check the accuracy of the new
value against the original result.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Solution: The first term of Equation 12-13 is

PFDavg =
( DU
TI ) 2

3
Substituting our values, PFDavg = 0.0000004 and RRF = 2,502,033.
The PFDavg represents only 45% of the previous value
(0.00000089), and the RRF is more than double. This result is
dangerous and misleading. The first term of Equation 12-13 is often
published as the Simplified Equation to be used for a 1oo2 system.
The author asserts it should only be used on components with no
diagnostics or a low coverage factor. For a low coverage factor (less
than 50%) this simplification will be more accurate (about 95%).

Markov Model Analysis with Diagnostics (No Common Cause)

Figure 12-8 shows a Markov model for the dual series wired switch (1oo2)
that shows the effect of on-line automatic diagnostics. Note the Markov
model does not account for common cause, and perfect repair is assumed.
The system is successful in states 0, 1 and 2. The system fails dangerous
(short circuit) in states 3, 4, and 5. The model has two degraded states,
state 1 and state 2. In state 1, one switch has failed dangerous, but this fail-
ure is detected by the diagnostics.

An on-line repair rate from state 1 to state 0 assumes that the system can
be repaired without a shutdown. In state 2, one switch has failed danger-
ous and the failure is not detected by on-line diagnostics and therefore no
repair rate from this state is shown.

In state 3, the system has failed dangerous with both failures detected by
diagnostics. From this state, the first switch repair takes the model back to
state 1. In state 4, the system has failed dangerous, but the failure is
detected in only one of the switches. Therefore, repair takes the model
back to state 2. The model assumes maintenance policy allows on-line
repair of the system without shutting down the process. State 5 represents
the condition where the system has failed dangerous and the failures are
not detected by diagnostics.

Copyright International Society of Automation

2LDD LDD
Degraded System Fail
M O Danger
MO 1 Fail
2 Detected
Detected LDU 3
OK 1
0 MO System Fail
Degraded
2LDU 1 Fail
Danger
LDD 1 D /1 U
Undetected 4
2
System Fail
LDU Danger
2 Undetected
5
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 12-8. 1oo2 System Markov Model Accounting for Diagnostics (No Common
Cause)

The P matrix for this Markov model is

1 (2 D ) 2 DD 2 DU 0 0 0

O 1 ( D + O ) 0 DD DU 0
0 0 1 ( D ) 0 DD DU
P =
0 2 O 0 1 2 O 0 0
0 0 O 0 1 O 0

0 0 0 0 0 1

EXAMPLE 12-7

Problem: A dual series wired switch system (1oo2) uses smart

switches with built-in diagnostics. The dangerous failure rate is
0.000005 failures per hour. Diagnostics detect 95% of the dangerous
failures. The switches are wired to a digital fieldbus used to
annunciate the failures. When failures are detected, the average
system repair time is 72 hours. The system is operated for six month
time intervals after which an inspection is made and any undetected
failures are repaired. What is the system PFDavg and RRF assuming
perfect inspection and repair?

Copyright International Society of Automation

EXAMPLE 12-7 continued

Solution: The Markov model of Figure 12-8 is appropriate.

Substituting the failure and repair rates into a spreadsheet provides
the numeric values of the P matrix:

P 0 1 2 3 4 5
0 0.99999 9.5-06 5-07 0 0 0
1 0.013889 0.986106 0 4.75-06 2.5-07 0
2 0 0 0.999995 0 4.75-06 2.5-07
3 0 0.27778 0 0.972222 0 0
4 0 0 0.013888889 0 0.986111 0
5 0 0 0 0 0 1

Starting in state 0 and repeatedly multiplying the P matrix by the S

matrix gives the PFD as a function of time. The PFDavg at 4392
hours is 0.00000089 and the RRF using PFDavg is 1,122,983.

Diagnostics and Common Cause

In previous models, where common cause was included in the model, it
was a significant factor in the results of our modeling. Since it is such an
important factor, another model must be created that shows the combina-
tion of on-line diagnostics and common cause.

The failure rates must be partitioned into eight categories:

SDN - Safe, detected, normal

SDC - Safe, detected, common cause
SUN - Safe, undetected, normal
SUC - Safe, undetected, common cause
DDN - Dangerous, detected, normal
DDC - Dangerous, detected, common cause
DUN - Dangerous, undetected, normal
DUC - Dangerous, undetected, common cause

The four failure rates SU, SD, DU and DD are divided using the beta
model:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

SDN = (1 - ) SD (12-14)

SDC = SD (12-15)

SUN = (1 - ) SU (12-16)

Copyright International Society of Automation

SUC = SU (12-17)

DDN = (1 - ) DD (12-18)

DDC = DD (12-19)

DUN = (1 - ) DU (12-20)

DUC = DU (12-21)

EXAMPLE 12-8

Problem: A switch has a safe failure rate of 0.000005 failures per

hour and a dangerous failure rate of 0.000005 failures per hour. On-
line diagnostics detect 95% of the dangerous failures and 90% of the
safe failures. Two identical switches are mounted side by side and
exposed to a common stress in a system. The beta factor is
estimated to be 0.1. What are the eight failure rates required to model
multiple failure modes, diagnostics, and common cause?

Solution: First the failure rates are divided by diagnostic capability

using Equations 12-8 through 12-11.

SD = 0.000005 0.9 = 0.0000045

SU = 0.000005 (1 - 0.9) = 0.0000005
DD = 0.000005 0.95 = 0.00000475
DU = 0.000005 (1 - 0.95) = 0.00000025

These failure rates are multiplied by beta using Equations 12-14

through 12-21. The following failure rates result:

SDN = 0.00000405
SDC = 0.00000045
SUN = 0.00000045
SUC = 0.00000005
DDN = 0.000004275
DDC = 0.000000475
DUN = 0.000000225
DUC = 0.000000025

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

System fails
Dangerous

Switch A fails Switch B fails Switch A fails Switch B fails

DU DU DU DD

A and B fail due to Switch A fails Switch B fails A and B fail due to Switch A fails Switch B fails
common stress DD DU common stress DD DD
DUC DDC

Figure 12-9. 1oo2 System Fault Tree Diagnostics and Common Cause

System fails
Dangerous

A and B fail due to A and B fail due to

common stress common stress
DDC DUC
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Switch A fails Switch A fails Switch B fails Switch B fails

DDN DUN DDN DUN

Figure 12-10. Alternative 1oo2 System Fault Tree - Diagnostics and Common Cause

Copyright International Society of Automation

Fault Tree Analysis with Diagnostics and Common Cause

Common cause can be incorporated into a fault tree. Figures 12-9 and Fig-
ure 12-10 each show a fault tree for a 1oo2 dual switch system in two dif-
ferent logical presentations. They show the same probability logic. These
figures should be compared to Figure 12-7. A dangerous undetected com-
mon-cause failure and a dangerous detected common-cause failure have
been added to the drawings.

These fault trees can be analytically evaluated. The probability of a dan-

gerous detected common-cause failure may be approximated by multiply-
ing DDC times RT. The probability of a dangerous undetected common-
cause failure may be approximated by multiplying DUC times TI. The
complete equation is

(
PFD = DUN TI )2
(
+ 2 DDN RT DUN TI + DDN RT ) ( ) 2
+ DDC RT + DUC TI (12-22)

The equation for PFDavg can be derived by integrating the PFD equation
and dividing by the time interval. This approximated average is given by:

TI
DUN 2 DDN DUN DDN 2 DDC DUC

1
PFD avg = ------ ( t ) + 2 RTt + ( RT ) + RT + t dt
TI
0

which evaluates to

3 2 2 TI
1 DUN 2 t DDN DUN t DDN 2 DDC DUC t
PFD = ------ ( ) ----- + 2 RT ----- + ( RT ) t + RTt + -----
avg TI 3 2 2 0

Substituting the integration limits:

PFDavg =
( )DUN 2
TI 2
+ DDN DUN RT TI + DDN RT ( )2
+ DDC RT + DUC
TI
(12-23)
3 2

EXAMPLE 12-9

Problem: Using failure rates from Example 12-8, calculate the

PFDavg and the RRF for the fault tree of Figure 12-9.

Solution: Using an approximation technique, substitute failure rate

and repair rate data into Equation 12-23 to obtain PFDavg.

PFDavg = 0.0000897, RRF = 11,151

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

LDDC
2LDDN LDD System Fail
Degraded
M O Danger
MO 1 Fail
2 Detected
Detected LDU 3
OK 1
0 MO System Fail
Degraded
2LDUN 1 Fail
Danger
LDD 1 D /1 U
Undetected 4
2
System Fail
LDUC LDU Danger
2 Undetected
5
Figure 12-11. 1oo2 System Markov Model Diagnostics and Common Cause

Markov Model Analysis with Diagnostics and Common Cause

A Markov model that shows a combination of multiple failure modes,
diagnostics, and common cause is drawn in Figure 12-11. It has the same
state combinations as Figure 12-8 but two additional failure arcs have been
added. There is a dangerous detected common-cause failure rate from
state 0 to state 3 and a dangerous undetected common-cause failure rate
from state 0 directly to state 5. The P matrix for this Markov model is

1 (DC + 2DN ) 2DDN 2DUN DDC 0 DUC

O 1 (D + O ) 0 DD DU 0
0 0 1 (D ) 0 DD DU
PP =
0 2O 0 1 2O 0 0
0 0 O 0 1 O 0

0 0 0 0 0 1

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 12-10

Problem: Using the Markov model of Figure 12-11 and the failure
rates of Example 12-8, calculate the PFDavg and RRF of the 1oo2
system for a time interval of six months.

Solution: When the failure rates and repair rates are substituted into
the P matrix, the following numeric values result:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P 0 1 2 3 4 5
0 0.9999905 0.00000855 0.00000045 0.000000475 0 0.000000025
1 0.013888889 0.986106111 0 0.00000475 0.00000025 0
2 0 0 0.999995 0 0.00000475 0.00000025
3 0 0.027777778 0 0.972222222 0 0
4 0 0 0.013888889 0 0.986111111 0
5 0 0 0 0 0 1

When this matrix is repeatedly multiplied times an S matrix starting

on state 0, the PFDavg at 4392 hours is 0.0000725. The RRF is
13,790.

Imperfect Inspection, Proof Testing, and Repair

All of the models so far have assumed that when the time comes for peri-
odic inspection test and repair, all dangerous failures are detected and are
correctly repaired. This assumption is not realistic. Many of the inspection
and test procedures used do not detect all dangerous failures, and repairs
are not always performed perfectly. Any activity requiring human action
has a probability of error. The impact on reliability and safety metrics can
be significant. With Markov modeling techniques, the impact of imperfect
inspection, testing, and repair on these variables can easily be modeled
across the lifetime of a device by using a non-constant repair rate for the
periodic inspection, test, and repair. Figure 12-12 shows the fault tree of
Figure 12-10 with the additional failure probabilities added.

This fault tree can be analytically evaluated although the equations get
complicated. The approximate probability of dangerous failure for one of
the lower OR gates is:

PFDOrGate = DDN RT + E DUN TI + (1 E ) DUN LT

where E is the inspection, test, and repair effectiveness and LT represents

the operational time interval between each major shutdown/overhaul.

Copyright International Society of Automation

System fails
Dangerous

A and B fail due to A and B fail due to A and B fail due to

common stress common stress common stress
DDC E(DUC) (1-E) (DUC)

Switch A fails Switch A fails Switch A fails Switch B fails Switch B fails Switch B fails
E (DUN) (1-E) (DUN) E (DUN) (1-E) (DUN)
DDN DDN
Proof Test Detect No Proof Test Detect Proof Test Detect No Proof Test Detect

Figure 12-12. Alternative 1oo2 System Fault Tree - Diagnostics, Common Cause, and Imper-
fect Proof Test

The complete approximation equation for Figure 12-12 is

PFD = DDC RT + E DUC TI + (1 E ) DUC LT

(
+ DDN RT + E DUN TI + (1 E ) DUN LT )2
(12-24)

Multiplying the second term of Equation 12-24 expands the equation to:

PFD = DDC RT (12-25)

+ E DUC TI (12-26)

+ (1 E ) DUC LT (12-27)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

+ ( DDN RT ) 2 (12-28)

+ 2 DDN RT E DUN TI (12-29)

+ 2 DDN RT (1 E ) DUN LT (12-30)

+ 2 E DUN TI (1 E ) DUN LT (12-31)

Copyright International Society of Automation

+ ( E DUN TI ) 2 (12-32)

+ ((1 E ) DUN LT ) 2 (12-33)

EXAMPLE 12-11

Problem: Using failure rates from Example 12-8, calculate the PFD for
each term of Equation 12-24 for the fault tree in Figure 12-12 and the
total PFD for a unit lifetime of 10 years. Assume E is 80%.

Solution: Using an approximation technique, substitute failure rate

and repair rate data into each term to obtain PFD for that term.

For Equation 12-25 PFD = 0.00003420

For Equation 12-26 PFD = 0.00008760
For Equation 12-27 PFD = 0.00043800
For Equation 12-28 PFD = 0.00000009
For Equation 12-29 PFD = 0.00000049
For Equation 12-30 PFD = 0.00000243
For Equation 12-31 PFD = 0.00000622
For Equation 12-32 PFD = 0.00000062
For Equation 12-33 PFD = 0.00001554

It is clear that some terms are more significant than others with the
most significant being those involving imperfect proof test failures
over the devices operating lifetime. The total PFD is 0.00058518.

EXAMPLE 12-12

Problem: Using failure rates from Example 12-8, calculate the PFDavg
and the RRF for the fault tree of Figure 12-12 for a unit lifetime of 10
years. Assume E is 80%.

Solution: It is relatively simple to calculate PFD as a function of

operating interval between each major shutdown/overhaul using a
spreadsheet and calculating a numerical average over the 10 year
interval. The PFD as a function of operating time interval is shown in
Figure 12-13. PFDavg = 0.000305 and RRF = 3278.

A Markov model can also include the effect of imperfect proof testing. The
Markov model of Figure 12-14 is similar to Figure 12-10. An additional
periodic repair rate from state 5 to state 0 was added. Figure 12-14 also
has an added state, state 6. In this state, dangerous failures are not
detected by either on-line diagnostics or the periodic inspection and test.
--``,,`,,,`,,`,`,,,```,,,``,``,

Copyright International Society of Automation

0.0006

0.0005

0.0004
PFD

0.0003

0.0002

0.0001

0
2 4 6 8 10
Operating Time Interval - Years

Figure 12-13. Alternative 1oo2 System Fault Tree Diagnostics, Common Cause, and
Imperfect Proof Test, PFD vs. Operating Time

LDDC
2LDDN LDD System Fail
Degraded
M O Danger
MO 1 Fail
2 Detected
Detected LDU 3
OK 1
0 MO System Fail
2LDUN Degraded
1 Fail
Danger
LDD 1 D /1 U
MP Undetected 4
2
MP E L System Fail
DUC
4: E LDU Danger
m 2 Undetected
del - 5
and (1-E)LDU
ause System Fail
ect
(1-E)LDUC Danger
2 Undetected
6
Figure 12-14: Markov Model Showing the Impact of Imperfect Proof Testing

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The P matrix for this model is

1 ( DC + 2 DN ) 2 DDN 2 DUN DDC 0 E DUC (1 E ) DUC

O 1 ( + O )
D
0 DD
DU
0 0
P 0 1 ( D + P ) 0 DD EDU (1 E ) DU

P = 0 2 O 0 1 (2 O ) 0 0 0
0 0 O 0 1 (O ) 0 0

P 0 0 0 0 1 ( P ) 0
0 0 0 0 0 0 1

The non-constant repair rate is modeled by substituting another P matrix

when the time counter hits the end of the periodic inspection, test, and
repair period (Ref. 1). The time period of the model must extend for the
life of a device as state 6 will accumulate a PFD for that entire time.

EXAMPLE 12-13

Problem: Using the Markov model of Figure 12-14 and the failure
rates of Example 12-8, calculate the PFDavg and RRF of the 1oo2
system for a unit lifetime of 10 years. Assume E is 80%.

Solution: When the failure rates and repair rates are substituted into
the P matrix, the following numeric values result:

P 0 1 2 3 4 5 6
0 0.999991 0.00000855 0.00000045 0.000000475 0 0.00000002 0.000000005
1 0.013889 0.98610611 0 0.00000475 0.00000025 0 0
2 0 0 0.999995 0 0.00000475 0.0000002 0.00000005
3 0 0.02777778 0 0.972222222 0 0 0
4 0 0 0.01388889 0 0.98611111 0 0
5 0 0 0 0 0 1 0
6 0 0 0 0 0 0 1

At the end of the period inspection, test, and repair time, the P matrix
is:

P 0 1 2 3 4 5 6
0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0
6 0 0 0 0 0 0 1

indicating everything is repaired except those failures going to state 6

where the inspection and proof test did not detect the failure.

Multiplying the P matrix times an S matrix starting in state 0, the

PFDavg over a 10 year operating time interval is 0.00028, and the
RRF is 3563. The PFD as a function of time is shown in Figure 12-15.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

0.0006

0.0005

0.0004
PFD

0.0003

0.0002

0.0001

0
2 4 6 8 10
Operating Time Interval - Years

Figure 12-15. PFD vs. Operating Time for Markov Model Showing the Impact of
Imperfect Proof Testing

Diagnostic Failures
What happens when the subsystem performing the on-line diagnostics
fails? Most of the time, this does not immediately impact the safety func-
tion, but it does cause loss of the automatic diagnostics that detect poten-
tially dangerous failures. This secondary failure may have a significant
impact on PFD and PFDavg depending on the quality of the diagnostics.
Systems with a very good diagnostic coverage factor will suffer the most
when a diagnostic failure occurs (Ref. 2).

The impact of a diagnostic failure is best shown using Markov techniques.

The Markov model of Figure 12-16 shows a single switch with diagnostics,
assuming the diagnostics will never fail. Two dangerous failure states are
shown. In state 1, the failure is detected and may be repaired. In state 2,
the failure is not detected by on-line diagnostics.

Figure 12-17 shows the impact of diagnostic failure. A new state (state 1) is
added to the model. In state 1, the diagnostics have failed. It can be seen
that the worst-case impact is that all failures previously classified as dan-
gerous detected are no longer detected and therefore go to state 3, danger-
ous undetected failure. This level of modeling is needed whenever the
diagnostic coverage is in the 95% plus range to be sure the results are not
optimistic (Ref. 3).

Probability of Initial Failure

In many systems there is a probability that the installation and commis-
sioning process was not perfect. In this situation the probability of failure
does not start at zero. Reference 4 has data from a field failure study of

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

LDD
FDD
MO 1
OK LDU

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0
FDU
2

Figure 12-16. Markov Model Single Switch with No Failure of Automatic

Diagnostics

LDD
FDD
MOL 2
OK LDU
0
LAU FDU

AU 3

1
LDD + LDU

Figure 12-17. Markov Model Single Switch with Failure of Automatic Diagnostics

pressure relief valves where an initial probability of failure of between 1%

and 1.6% was found. That can easily be modeled with fault tree and
Markov modeling techniques. For a fault tree, simply add the initial prob-
ability of failure as an OR condition in the model as shown in Figure 12-18.
For a Markov model solved using matrix multiplication techniques, the
starting row vector probabilities can be changed to accurately model the
probability of starting in any state.

Copyright International Society of Automation

System fails
Dangerous

A and B initial A and B fail due to A and B fail due to A and B fail due to
common cause common stress common stress common stress
failure DDC E(DUC) (1-E) (DUC)

Switch A Switch B
Initial PFD Initial PFD

Figure 12-18. Fault Tree Including Initial Failure Probabilities

Comparing the Techniques

Through the use of the various techniques, a range of answers has been
obtained for the same fundamental problem. This result alone shows the
importance of modeling at the proper level of detail. A comparison (Table
12-1) will give some insight into the value of model detail.

Table 12-1. Results Comparison

Risk Reduction Factors Fault Tree Markov
Ideal 6,255 6,309
Common Cause 817 861
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Diagnostics 1,122,172 1,122,983
Diagnostics/Common Cause 11,152 13,790
Imperfect Proof Testing 3,278 3,563

The differences between a fault tree with probability approximation

inputs and a Markov model solved via matrix multiplication are not sig-
nificant for the example time intervals at a given level of detail, although
the Markov model gives slightly better results. It is important that the
fault tree equations be integrated after the fault tree logic is applied. It is
also important that the Markov be solved with matrix multiplication, not
with a steady-state approximation.

Copyright International Society of Automation

In Closing
It can be seen that on-line diagnostics have a major impact on the accuracy
and validity of modeling results. Building a model that ignores on-line
diagnostics can be unnecessarily pessimistic. Common cause has a major
impact, as does imperfect proof testing and repair. A model that does not
consider common cause and imperfect proof testing and repair will pro-
duce overly optimistic results.

Simplified equations are typically obtained by eliminating terms in the

equations developed here. Using these simplified equations will reduce
accuracy and may introduce serious errors, with their severity depending
on the parameters of the model.

Exercises
12.1 Example 12-9 uses a fault tree that accounts for both common
cause and on-line diagnostic capability. Repeat that example using
a smaller beta factor of 0.02.
12.2 Compare the result of Exercise 12.1 to the result of Example 12-3.
What is the % difference?
12.3 Repeat Example 12-9 for a periodic inspection interval of one year
(8760 hours).
12.4 Why is the RRF result of Exercise 12-6 so much higher than the
others?

Answers to Exercises
12.1 Lowering the beta to 0.02 improves the RRF to 53,631.
12.2 The RRF from Example 12-1 was 6255. The RRF from Example 12-3
was 816. Example 12-3 RRF was about 13% of that of Example 12-1.
This shows almost an order of magnitude difference in the result.
12.3 The RRF dropped from 11,151 to 6,864.
12.4 Exercise 12-6 used a simplified equation that does not consider
PFD due to repair time, and it does not consider common cause or
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

imperfect proof testing. As this example shows, rarely should sim-

plified equations be used for PFD design verification calculations.

Copyright International Society of Automation

References
1. Bukowski, J. V. Modeling and Analyzing the Effects of Periodic
Inspection on the Performance of Safety-Critical Systems. IEEE
Transactions on Reliability (Vol. 50, No. 3). New York: IEEE,
Sep. 2001.

2. Goble, W. M. and Bukowski, J. V. Extending IEC 61508 Reliability

Evaluation Techniques to Include Common Circuit Designs Used
in Industrial Safety Systems. Proceedings of the Annual Reliability
and Maintainability Symposium. New York: IEEE, 2001.

3. Goble, W. M. Markov Modeling of DeltaV SLS1508. Proceedings

of the 2006 Emerson Exchange. Austin: Emerson Fisher Rosemount
Systems, 2006.

4. Bukowski, J. V. Results of Statistical Analysis of Pressure Relief Valve

Proof Test Data Designed to Validate a Mechanical Parts Failure Data-
base. Sellersville: exida, 2007.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Safety Model
Construction
System Model Development
A successful reliability and safety evaluation of a system depends in large
part, on the process used to define the model for the system. Although not
sufficient by itself, a knowledge of proper system operation is essential.
Perhaps more important is an understanding of system operation under
failure conditions. One of the best tools to systematically gain such an
understanding is Failure Modes and Effects Analysis (FMEA) (Chapter 5).

A series of steps, including an FMEA, can be taken to help ensure the con-
struction of an accurate reliability and safety model. The following steps
are recommended:

1. Define system failure.

2. Complete a system level FMEA.
a. Identify and list all system components.
b. For each system component, identify all failure modes and
system effects.
3. Categorize failures according to effect.
4. Determine the appropriate level of model detail.
5. Develop the model.
a. List all failure rates.
b. Build a model that accounts for all failure rates.
c. Simplify the model if possible.
6. Solve for the needed reliability and safety measures.

283
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
284 Control Systems Safety Evaluation and Reliability

STEP 1: Define System Failure

It is necessary to clearly define system failure. The lack of a clear defini-
tion has compromised many reliability and safety analysis efforts. Decid-
ing on such a definition may be hard, especially in situations in which
systems run in a degraded mode when a component fails. Is this degraded
mode a system failure? This depends on how the system is being used.
Fortunately, for industrial control systems, the answer is usually clear.
Necessary system functions must be accomplished in a specified time period. If a
degraded mode does this, that mode is not a system failure.

STEP 2: Complete a System Level FMEA

Complete a Failure Modes and Effects Analysis on all components within
the system. The scope of the system must be identified. For control and
safety systems, this scope usually includes process connections, sensors,
controllers, actuators, and valves. A system level FMEA is a critical quali-
tative procedure. The accuracy of the information obtained in this step
influences the model that is formed and the reliability and safety measures
that are obtained.

Consider a safety system that consists of a pressure switch, two identical

single-board controllers, and a valve (Figure 13-1). When the process is
operating normally, pressure is low and the pressure switch is closed
(energized). Both controllers keep their outputs energized, and the valve is
closed. Whenever pressure exceeds a preset limit, the switch opens (de-
energizes). Both controllers read the pressure switch status and execute a
timing and filtering function. When the pressure switch is open for a suffi-
cient period of time and the process is active (not in start-up or inactive
mode), both controllers will de-energize their outputs. When either con-
troller de-energizes its output, the valve opens and relieves pressure.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 13-1. Safety System

Copyright International Society of Automation

If the system fails such that the pressure switch will not open, both con-
troller outputs are always energized or the valve is jammed closed, it is
called a dangerous failure because the safety system cannot relieve the
pressure. If the system fails in such a way that the pressure switch fails
open, either controller output is failed de-energized or the valve fails
open, it is called a safe failure, since pressure is inadvertently relieved.

The FMEA chart is presented in Table 13-1. Each system level component
is listed along with its failure modes. The system effect for each failure is
listed and categorized. The failure rates are typically obtained from the
component manufacturers. These are given in units of FITS (failures per
109 hours).

Table 13-1. FMEA Chart

Failure Modes and Effects Analysis
1 2 3 4 5 6 7 8 9
Name Code Function Mode Cause Effect Criticality L Remarks
Valve VALVE1 opens to relieve fail open power, false trip safe 1000 valve opens for safety mitigation
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

pressure corrosion.
jam closed corrosion, cannot trip dangerous 100
dirt
coil open elec. surge false trip safe 50 if coil opens, valve opens
coil short corrosion, false trip safe 50 if coil shorts, valve opens
wire
pres. switch PSH1 sense short power surge system output dangerous 100
overpressure energize -
cannot trip
open many false trip safe 400
ground fault corrosion cannot trip dangerous 100 assume grounding of positive side
Controller 1 C1 logic solver no comlink many no effect --- 145 no effect on safety function
output energize surge, heat cannot trip dangerous 230
output open many false trip safe 950
Controller 2 C2 logic solver no comlink many no effect --- 145 no effect on safety function
output energize surge, heat cannot trip dangerous 230
output open many false trip safe 950

STEP 3: Categorize Failures According to Effect

The failures are categorized as directed by the FMEA. The FMEA chart
shows two primary categories of system level failures: safe (valve open)
and dangerous (valve closed).

Starting with the two primary failure categories, all FMEA failure rates
can be placed into one of the two categories using a Venn diagram (Figure
13-2). For example, a short circuit failure of the pressure switch belongs in
the dangerous (valve closed) category. The total failure rate is divided:

S D
TOT = + (13-1)

where the superscript S represents a safe failure and the superscript D rep-
resents a dangerous failure.

Copyright International Society of Automation

Controller output de-energized/

Valve Open

SAFE

Controller output energized /

Valve Closed
DANGEROUS

Figure 13-2. Venn Diagram of System Failure Modes

EXAMPLE 13-1

Problem: Divide the failure rates of Table 13-1 into the two
categories, safe and dangerous, based on the FMEA table.

Solution: For each component, add the failure rates for each mode:

Valve SAFE DANGEROUS

1000 100
50
50
SVALVE1 = 1100 109 failures per hour
DVALVE1 = 100 109 failures per hour

Pressure Switch SAFE DANGEROUS

400 100
100
SPSH1 = 400 109 failures per hour
DPSH1 = 200 109 failures per hour

Controller SAFE DANGEROUS

950 230
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

S
C = 950 109 failures per hour
D
C = 230 109 failures per hour

Note that the communications failure mode has no effect on safety so

that failure rate has been discarded. Doing this would not be valid if
communications were required as part of system functionality.

Copyright International Society of Automation

STEP 4: Determine the Appropriate Level of Model Detail

The level of detail chosen for a model will have an effect on the result.
Generally, more accurate models have more detail. For the example of Fig-
ure 13-1, multiple failure modes, common cause, and on-line diagnostics
will now be included. The model will be solved using time dependent
Markov solution techniques.

As we have seen, common cause (Chapter 10) is a factor when redundant

devices exist in a system. This example has redundant controllers. Com-
mon-cause failures can result in system safe or dangerous failures so the
controller failure rates are divided into two categories: safe common-cause
failures and dangerous common-cause failures, using a simple beta
model. The superscript SC is used to designate safe common-cause fail-
ures, and the superscript SN is used to designate safe normal failures; we

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
must also divide the dangerous failures into two groups, dangerous com-
mon-cause failures (DC) and dangerous normal failures (DN). The failure
rates are divided into two mutually exclusive groups where:

S SC SN
= + (13-2)

and

D DC DN
= + (13-3)

The beta factor is based on the chances of a common stress failing multiple
redundant components. The factors to be considered are physical location,
electrical separation, inherent strength of the components versus the envi-
ronment, and any diversity in the redundant components. Though there
may be different beta factors for safe failures and dangerous failures, the
considerations are the same; therefore, the same is typically used.

The failure rates must be further classified into those that are detected by
on-line diagnostics and those that are not. Detected failures are classified
as Detected. Those not detected by on-line diagnostics are classified as
Undetected. Both safe and dangerous normal failures are classified. Safe
and dangerous common-cause failures are also classified.

Coverage is the measure of the built-in diagnostic capability of a system

(Chapter 9). It is represented by a number from 0 to 1 and denoted by C. If
the failure rates are not already provided by an FMEDA (Failure Modes,
Effects, and Diagnostic Analysis), a coverage factor must be obtained for
each component in the system in order to separate the detected failures

Copyright International Society of Automation

EXAMPLE 13-2

Problem: The system of Figure 13-1 has redundant controllers;

therefore, common cause must be modeled. Divide the controller
failure rates into normal and common-cause categories.

Solution: This example has two identical controllers mounted in one

cabinet with no electrical connection between units except I/O wiring.
Past records indicate that high stress events occur approximately
10% of the time. Therefore a beta of 0.1 is applied. The controller
failure rates given in the FMEA chart are divided into four categories:
SC
C = 0.1 ( 950 ) = 95

SN
C = 0.9 ( 950 ) = 855

DC
C = 0.1 ( 230 ) = 23

DN
C = 0.9 ( 230 ) = 207

from the undetected failures. The eight failure rate categories (Chapter 12)
are calculated as follows:

SDN S SN
= C (13-4)

SUN S SN
= ( 1 C ) (13-5)

DDN D DN
= C (13-6)

DUN D DN
= ( 1 C ) (13-7)

SDC S SC
= C (13-8)

SUC S SC
= ( 1 C ) (13-9)

DDC D DC
= C (13-10)
DUC D DC
= ( 1 C ) (13-11)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 13-3

Problem: Calculate the failure rates for the system components of

Figure 13-1.

Solution: The controller manufacturer provides a FMEDA analysis

from a well known third party that has a safe coverage factor of 0.97
and a dangerous coverage factor of 0.99. Using Equations 13-4
through 13-11 for the controller:

SDN = 0.97 855 = 829

SUN = (1-0.97) 855 = 26
DDN = 0.99 207 = 205
DUN = (1-0.99) 207 = 2
SDC = 0.97 95 = 92
SUC = (1-0.97) 95 = 3
DDC = 0.99 23 = 22.7; round to 23 as a worst case
DUC = (1-0.99) 23 = 0.23; round up to 0.3 as a worst case

NOTE: Regarding the rounding of failure rate calculated data, many

practitioners including the author pessimistically round a failure rate
number up to avoid the implication of precision that is not there in the
number. The threshold at which one rounds up or down is a judgment
and many use the rules of math with a threshold of X.50 to round up
to X + 1. The author generally uses a threshold of X.20 to round up to

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
X + 1 as the rules of math do not seem pessimistic enough.

The pressure switch has no diagnostics, but the controller is capable

of detecting ground fault failures in the switch so those failures are
categorized as dangerous detected. Other pressure switch failures
are categorized as undetected. The valve has no diagnostics but the
controller can detect shorted or open solenoid coils. The valve failure
rates are divided accordingly. The results are listed in Table 13-2.

Table 13-2. Calculation Results

Component Coverage SDC SUC DDC DUC SDN SUN DDN DUN
Switch --- --- --- --- --- 0 400 100 100
Cont. Safe 0.97 92 3 --- --- 829 26 --- ---
Cont. Dang. 0.99 --- --- 23 0.3 --- --- 205 2
Valve --- --- --- --- --- 100 1000 --- 100
Failure rates are in FITS.

As a check, add up the failure rate categories for each component. The
total (with allowance as appropriate for rounding) should equal the start-
ing number. For a controller, the total equals 1180. All totals match.

Copyright International Society of Automation

STEP 5: Develop the Model

Model construction should begin by creating a checklist of all failure rates
that must be included in the model. This will provide a way to help verify
that the model is complete. Regardless of the modeling technique, a check-
list should be used. Table 13-3 is an example of such a checklist. Once the
checklist is complete, the construction process depends on the type of
model chosen (Reliability Block Diagram, Fault Tree, Markov Model). As
this model requires the modeling of multiple failure modes, Reliability
Block Diagrams will not be used as Fault Tree and Markov Model meth-
ods more clearly show the multiple failure modes.

Table 13-3. Failure Rate Checklist

Non-redundant SD SU DD DU
Switch 0 400 100 100
Valve 100 1000 --- 100
Common Cause Controller SDC SUC DDC DUC
92 3 23 0.3
Normal Controller SDN SUN DDN DUN
Controller 1 829 26 205 2
Controller 2 829 26 205 2
Failure rates are in FITS.

Fault Tree Model Construction

Each fault tree shows a single failure mode. In order to solve this model
for both failure modes, two fault trees must be created. The process is the
same:

1. Draw top level event showing system failure mode.

2. Draw all lower events that will trigger the top event.
3. Continue the process with all failure rates.

Figure 13-3. First Step, PFD Fault Tree

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The first step in fault tree construction for the dangerous failure mode is
shown in Figure 13-3. Three major events will cause the system to fail dan-
gerous: a dangerous switch failure, a dangerous failure in both controllers,
and a dangerous valve failure.

The process continues using the failure rate checklist. A switch can fail
dangerous if it fails dangerous detected or if it fails dangerous undetected.
An OR gate is added to the fault tree showing these events. The only dan-
gerous valve failure is dangerous undetected. A basic fault showing this
condition is added to the fault tree.

There are a number of ways in which both controllers can fail dangerous.
These include a detected common cause failure or an undetected common

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cause failure. Both controllers can fail dangerous if any of four combina-
tions of dangerous detected and dangerous undetected occur. These are
also added to the fault tree. The complete model is shown in Figure 13-4.

Figure 13-4. PFD Fault Tree

Solve the Fault Tree Model

The fault tree of Figure 13-4 can be solved for the probability that the sys-
tem will fail dangerous. This term is called probability of failure on
demand, PFD. Probability approximations can be used to determine basic
fault probabilities. These are combined using the methods of Chapter 6 to
solve for PFD. PFDavg can be calculated by integrating any analytical
result or numerically averaging time dependent probability numbers.

Copyright International Society of Automation

Markov Model Construction

If the Markov modeling technique (Chapter 8) is chosen, different steps
are used to construct the model. Markov model construction begins from a
state in which all components are successful. This state is typically num-
bered state 0. When building a Markov model, follow the rule: For any
successful state, list all failure rate categories for all successful compo-
nents. In this case, all components are successful. Therefore, the failure
rate categories from the pressure switch, the two controllers, and the valve
should exit from state 0. The same checklist (Table 13-3) is redrawn as Fig-
ure 13-5.

Figure 13-5. Failure Rate Category Checklist

Failures from a number of these failure rate categories will result in an

open valve (fail safe). These are circled in Figure 13-6. The Markov model
will have a failure state in which the valve has failed open. The initial
Markov model resulting from our first step is shown in Figure 13-7.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 13-6. System Fail Safe Failure Rate Categories

The arc drawn from state 0 to the new failure state is labeled with the sym-
bol 1. This failure rate includes all those circled in Figure 13-6:

SDC SUC SU SDN SUN SDN SUN SD SU

1 = C + C + PSH1 + C1 + C1 + C2 + C2 + VALVE1 + VALVE1 (13-12)

Copyright International Society of Automation

Figure 13-7. First Step Markov Model

Next, system behavior during a dangerous controller failure is examined.

One controller can fail with its output energized without causing a system
failure. This happens because the controller outputs are wired in series.
Although one controller has failed with output energized, the other con-
troller can still de-energize the valve if needed. This situation requires that
more states be added to the diagram.

Four failure rates cause new states. These are circled in Figure 13-8. When
the additional system success states are added to the Markov model, the
interim diagram looks like Figure 13-9. Since states 2 and 3 represent fail-
ures detected by on-line diagnostics, an on-line repair rate (made without
shutting down, assuming that maintenance policy allows) goes from these
states to state 0.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 13-8. Controller Fail Dangerous Failure Rate Categories

The failure rates from Figure 13-9 are

DDN
2 = C1 (13-13)

DDN
3 = C2 (13-14)

DUN
4 = C1 (13-15)

Copyright International Society of Automation

Figure 13-9. Second Step Markov Model

DUN
5 = C2 (13-16)

The remaining failure rates from Figure 13-5 cause the system to fail with
the valve closed. These are circled in Figure 13-10. To show this system
behavior, two more failure states are added to the model. State 6 repre-
sents a condition where the system has failed with the valve closed (dan-
gerous) but the failure is detected. In state 7, the system has failed
dangerous and the failure is not detected. Assuming maintenance policy
allows an on-line repair could be made from state 6. This is indicated with
a repair arc from state 6 to state 0. The new states are shown in Figure 13-
11, where:

DDC DD
6 = C + PSH1 (13-17)

DUC DU DU
7 = C + PSH1 + VALVE1 (13-18)

A check of the failure rates originally listed in Figure 13-5 will show that
all failure rate categories have been included in the Markov model.

The Markov model continues by remembering the rule: For each success-
ful system state, list all failure rate categories for all successful compo-
nents. The interim model now has four successful system states that must
be considered. Construction continues from state 2. In this state the system
has one controller, the pressure switch, and the valve working success-
fully. State 1 failure rates are circled in Figure 13-12.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 13-10. Controller Fail Dangerous Failure Rate Categories

Figure 13-11. Step Three Markov Model

Figure 13-12. State 1 Failure Rates

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 13-13. Step Four Markov Model

An examination shows that these failures will send the system to either
state 1 (the valve open failure state), state 6 (the valve closed with all fail-
ures detected state), or new system failure states where some failures are
detected and some are not. These will be modeled with two new states in
order to show the effects of on-line repair. Four arcs are added from state 2
as shown in Figure 13-13, where:

SU SD SU SD SU
8 = PSH1 + C + C + VALVE1 + VALVE1 (13-19)

DD DD
9 = PSH1 + C (13-20)

DU DU
10 = PSH1 + VALVE1 (13-21)

DU
11 = C (13-22)

State 3 is also a system success state. An examination of failure rates from

state 3 shows that they are the same as from state 2. Therefore, the same
arcs are added to state 3, creating two additional new states as shown in
Figure 13-14 along with associated repair arcs. The new states are added to
the Markov model shown in Figure 13-15.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 13-14. Failure Rates from State 3

Figure 13-15. Step Five Markov Model

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 13-16. State 4 Failure Rates

The model is still not complete. States 4 and 5 are system success states
where additional failures will fail the system. Figure 13-16 shows the fail-
ure rates in state 4. Any safe failure rate will take the Markov model to
state 1. That group of failure rates is the same group as in states 2 and 3.
Any dangerous detected failure rate will take the system to a new failure
state where there is a combination of detected and undetected component
failures. A repair from that new state will return the system to state 4. Any
dangerous undetected failure will take the system to state 7 where all dan-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
gerous failures are undetected. These are labeled:

DU DU DU
12 = PSH1 + VALVE1 + C (13-23)

State 5 has the same failure and repair rates. Figure 13-17 shows the arcs
from states 4 and 5. Figure 13-18 shows these additions to the completed
Markov model. The system is successful in states 0, 2, 3, 4, and 5. The sys-
tem has failed safe in state 1. The system has failed dangerous in states 6,
7, 8, 9, 10, 11, 12, and 13.

Note that the repair arcs from dangerous system failure states where com-
ponent failures are detected (states 6, 8, 9, 10, 11, 12, and 13) are valid only
if repairs are made to the system without shutting it down. In some com-
panies, operators are instructed to shut down the process if a dangerous
detected failure occurs. The model could be changed to show this by
replacing the current repair arc with an arc from those states to state 1
with the average shutdown rate (1/average shutdown time).

The Markov model could also change depending on maintenance policy.

If it is policy that whenever a service call is made, all system components
are inspected and repaired, then repair arcs from states 6, 8, 9, 10, 11, 12,
and 13 will all go to state 0.

Copyright International Society of Automation

Figure 13-17. State 4 and 5 Failure Rates

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 13-18. Completed Markov Model

Copyright International Society of Automation

Simplify the Markov Model

In some Markov models, states can be merged [Ref. 1]. The rule is: When
two states have transition rates that are identical to another state, those
two states can be merged into one. Entry rates are added. Exit rates remain
the same. Within the model, one obvious simplification can be made
when it is assumed that all components are inspected and repaired when a
service call is made. States 6, 8, 9, 10, 11, 12 and 13 can all be merged into
state 6. This is shown in Figure 13-19.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 13-19. First Simplification of Markov Model

Further state merging is possible. Both state 2 and state 3 have identical
transition rates to state 0. These two states also have the same exit rate to
states 1, 6, and 7. There are no more exit rates to check; these two states can
be merged. A similar situation exists for state 4 and state 5. These also can
be merged. A repair rate may be added from state 1 to state 0 if a process is
restarted after a shutdown. The simplified Markov model is redrawn in
Figure 13-20.

Other simplification techniques include decomposition of the system into

stages, and model truncation [Ref. 2]. These techniques can be used by the
experienced engineer, especially if only upper and lower bounds on
appropriate reliability indices are necessary.

Copyright International Society of Automation

Figure 13-20. Simplified Markov Model

STEP 6: Solve for the Needed Reliability and Safety Measures

Using the techniques developed in Chapter 8, we can solve for the needed
measures of system reliability. All calculations start with the transition
matrix. For Figure 13-20, the transition matrix, P, is
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

7
1 i 1 2 + 3 4 + 5 6 7

i=1
SD 1 ( SD ) 0 0 0 0
P =
O 8 1 ( O + 8 + 9 + 10 + 11 ) 0 9 + 10 + 11 0
0 8 0 1 ( 8 + 9 + 12 ) 9 12
O 0 0 0 1 O 0
0 0 0 0 0 1

From this matrix many reliability and safety measures can be calculated.
These include MTTF, MTTFS, MTTFD, time dependent PFD, PFDavg,
availability, PFS, and many others All calculations can be done on a per-
sonal computer spreadsheet.

Copyright International Society of Automation

Exercises
13.1 Why is a system level FMEA necessary to accurately model control
system reliability?
13.2 List degraded modes of operation that are valid for a control sys-
tem. Are these modes of operation considered failures?
13.3 A control system has no on-line diagnostics and no parallel (redun-
dant) components. The control system can fail only in a de-ener-
gized condition. How many failure rate categories exist?
13.4 Under what circumstances are common-cause failures distin-
guished from normal failures?
13.5 When must detected versus undetected failures be modeled differ-
ently?

Answers to Exercises
13.1 An FMEA (or equivalent) is necessary to understand how system
components fail and how those failures affect the system.
13.2 Many different degraded modes of operation exist. In some cases,
the response time of the system slows. This should be considered a
failure only if the system response time becomes slower than the
requirement. In other cases, an automatic diagnostic function fails.
This has an impact on reliability and safety and can be modeled
but it is typically not a failure per the stated requirements. If one
considers the necessary performance of the system as described by
the requirements, other degraded modes of operation may become
clear.
13.3 One failure mode, no diagnostics, and no common cause: one fail-
ure rate should be modeled.
13.4 Common cause failures must be modeled when redundant compo-
nents exist in a system.
13.5 It is important to distinguish detected failures from undetected
failures when the system takes a different action with these fail-
ures; for example, when detected failures de-energize a switch and
undetected failures do not. It is also important to distinguish these
failures when a different repair situation will exist, which is usu-
ally the case.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

References
1. Shooman, M. L., and Laemmel, A. E. Simplification of Markov
Models by State Merging. 1987 Proceedings of the Annual Reliability
and Maintainability Symposium, IEEE, 1987.

2. Laemmel, A. E. and Shooman, M. L. Bounding and Approximat-

ing Markov Models. 1990 Proceedings of the Annual Reliability and
Maintainability Symposium, IEEE, 1990.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-

Copyright International Society of Automation

Introduction
There are many ways in which to arrange control system components
when building a system. Some arrangements have been designed to maxi-
mize the probability of successful operation (reliability or availability).
Some arrangements have been designed to minimize the probability of
failure with outputs energized. Some arrangements have been designed to
minimize the probability of failure with outputs de-energized. Other
arrangements have been designed to protect against other specific failure
modes.

These various arrangements of control system components are referred to

as system architectures. This chapter presents a few of the common
architectures for programmable electronic systems (PES). These are
generic architectures. Actual commercial architectures vary consider-
ably and may even be combinations of the common architectures pre-
sented in this chapter. It should be remembered that field devices can also
be implemented with many of these architectures. Table 14-1 shows a list
of common architectures and their primary attributes. The term Safety
Fault Tolerance indicates the number of extra units that exist to maintain
safety (avoid the fail-danger state). The term Availability Fault Toler-
ance indicates the number of extra units that exist to maintain availability
(avoid the fail-safe state). These terms are an expansion of the term Hard-
ware Fault Tolerance from IEC 61508.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
305
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
306 Control Systems Safety Evaluation and Reliability

Table 14-1. Architectures

Number Output Availability Fault
Architecture Safety Fault Tolerance Objective
of Units Switches Tolerance
1oo1 1 1 0 0 Base Unit
1oo2 2 2 1 0 High Safety
2oo2 2 2 0 1 High Availability
avoid process trip
1oo1D 1 2 0 failure not detected 0 High Safety
1 failure detected
2oo3 3 6 (4*) 1 1 Safety and Availability
2oo2D 2 4 0 failure not detected 1 Safety and Availability
1 failure detected bias toward avail-
ability, avoid process
trip
1oo2D 2 4 1 0 failure not detected Safety and Availability
1 failure detected bias toward safety
*Some implementations of 2oo3 use four output switches. This requires special modeling beyond
the models provided in this chapter.

Four of the architectures, 1oo1, 1oo2, 2oo2 and 2oo3, have existed since the
early days of relay logic. The architectures with the D designation have
one or more output switches controlled by automatic diagnostics. These
diagnostics are used to control system failure modes and to modify the
failure behavior of units within the system. These architectures were
developed starting in the 1980s when microcomputer systems had enough
computing power to perform good automatic diagnostics.

Three of the architectures have become common in commercial imple-

mentations that provide both high availability and high safety. These are
the 2oo3, 1oo2D, and 2oo2D. The D architectures depend on exceptionally
good diagnostics but with diagnostic coverage ratings exceeding 99%,
these architectures provide excellent performance for both safety and
availability.

Single Board PEC

The various architectures and their characteristics can be explained using
a simple model for a programmable electronic controller (PEC) as one unit
with a single module. Figure 14-1 shows a model of a single board control-
ler with input circuits, output circuits, and common circuitry. This level of
detail is sufficient for architecture comparison. The single board architec-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

ture is characteristic of many smaller programmable logic controllers

(PLC) and single-loop controllers (SLC). These units typically have a
power supply, a microprocessor, and some quantity of industrial process
I/O points on a single printed circuit board assembly. This board is a sin-
gle repairable module; therefore, we can assume that any component fail-
ure will cause the controller to fail. The failure rate for the controller is the
sum of the failure rates for all components.

Copyright International Society of Automation

Figure 14-1. Single board Controller Model

EXAMPLE 14-1

Problem: An FMEDA is done on a conventional single board micro

PLC with eight inputs and four outputs. The results of the input circuit
FMEDA indicate the following failure rate categories:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

ICSD = 0 ICSU = 70 ICDD = 0 ICDU = 24

(Refer to Table 5-3 for this FMEDA)
(All failure rates are in units of FITS, failures per billion hours.)

The results of the output circuit FMEDA are:

OCSD = 26 OCSU = 27 OCDD = 4 OCDU = 34

The results of the common circuitry FMEDA are:

MPSD = 144 MPSU = 203 MPDD = 289 MPDU = 272

What are the failure rate categories for the single board controller?
What are the diagnostic coverage factors for both safe and
dangerous failures?

Solution: The unit has eight input circuits, four output circuits, and
one common set of circuitry. It is assumed that this is a series system
(the failure of any circuit will cause failure of the entire unit) so the
failure rates may be added per Equation 7-6. For the single board
controller (SBC):

SBCSD = 8 0 + 4 26 + 144 = 248 failures per billion hours

SBCSU = 8 70 + 4 27 + 203 = 871 failures per billion hours
SBCDD = 8 0 + 4 4 + 289 = 305 failures per billion hours
SBCDU = 8 24 + 4 34 + 272 = 600 failures per billion hours

Conventional SD SU DD DU Total
Conv. In 0 70 0 24 94
Conv. Out 26 27 4 34 91
Conv. MPU 144 203 289 272 908
Total 248 871 305 600 2024

Copyright International Society of Automation

EXAMPLE 14-1 continued

The diagnostic coverage factor for safe failures is the detected safe
failure rate (SD) divided by the total safe failure rate.
CS = 248 / (248 + 871) = 22.2%
The diagnostic coverage factor for dangerous failures is the detected
dangerous failure rate (DD) divided by the total dangerous failure
rate.
CD = 305 / (305 + 600) = 33.7%

EXAMPLE 14-2

Problem: An FMEDA is done on a small IEC 61508 Certified safety

PLC with eight inputs and four outputs. This unit was designed
specifically for critical applications and is referred to as a single safety
controller (SSC). The results of the critical input circuit analysis
indicate the following failure rate categories:

ICSD = 87 ICSU = 1 ICDD = 22 ICDU = 1

(Refer to Table 5-4 for FMEDA)
(All failure rates are in FITS.)

The results of the critical output circuit FMEDA are:

OCSD = 85 OCSU = 10 OCDD = 50 OCDU = 1

The results of the critical common circuitry FMEDA are:

MPSD = 496 MPSU = 5 MPDD = 458 MPDU = 6

What are the failure rate categories for the single safety controller?
What are the diagnostic coverage factors for both safe and
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
dangerous failures?

Solution: The unit has eight input circuits, four output circuits, and
one common set of circuitry. It is assumed that this is a series system
so the failure rates may be added per Equation 7-6. For this SSC:

SSCSD = 8 87 + 4 85 + 496 = 1532 failures per billion hours

SSCSU = 8 1 + 4 10 + 5 = 53 failures per billion hours

SSCDD = 8 22 + 4 50 + 458 = 834 failures per billion hours

SSCDU = 8 1 + 4 1 + 6 = 18 failures per billion hours

Safety Certified SD SU DD DU Total

Safety In 87 1 22 1 111
Safety Out 85 10 50 1 146
Safety Common 496 5 458 6 965
Total 1532 53 834 18 2437

Copyright International Society of Automation

EXAMPLE 14-2 continued

The diagnostic coverage factor for safe failures is the detected safe
failure rate (SD) divided by the total safe failure rate.

CS = 1532 / (1532 + 53) = 96.7%

The diagnostic coverage factor for dangerous failures is the detected

dangerous failure rate (DD) divided by the total dangerous failure
rate.

CD = 834 / (834 + 18) = 97.9%

EXAMPLE 14-3

Problem: Two units of conventional micro PLC electronics (data from

Example 14-1) will be wired together to form a two unit system.
Common cause will be modeled. Therefore the failure rates must be
split into common cause versus normal. A beta model will be used
and is estimated at 0.02. What are the common cause and normal
failure rates for the conventional micro PLC?

Solution: The four failure rate categories are divided using Equations
12-14 through 12-21.

SBCSDC = 248 FITS 0.02 = 5 x 10-9 failures per hour =

5 failures per billion hours

SBCSUC = 871 FITS 0.02 = 17.4 x 10-9 failures per hour =

17.4 failures per billion hours

SBCDDC = 305 FITS 0.02 = 6.1 x 10-9 failures per hour =

6.1 failures per billion hours

SBCDUC = 600 FITS 0.02 = 12 x 10-9 failures per hour =

12 failures per billion hours

SBCSDN = 248 FITS (1 - 0.02) = 243 x 10-9 failures per hour =

243 failures per billion hours

SBCSUN = 871 FITS (1 - 0.02) = 853.6 x 10-9 failures per hour =

853.6 failures per billion hours

SBCDDN = 305 FITS (1 - 0.02) = 298.9 x 10-9 failures per hour =

298.9 failures per billion hours

SBCDUN = 600 FITS (1 - 0.02) = 588 x 10-9 failures per hour =

588 failures per billion hours

To check for errors, the total is calculated. It should match the

previous total. It does.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 14-4

Problem: Two units of single safety controller electronics (data from

Example 14-2) will be wired together to form a dual unit system.
Common cause will be modeled. Therefore the failure rates must be
split into common cause versus normal. is estimated at 0.02. What
are the common causes and normal failure rates for the safety PLC?

Solution: The four failure rate categories are divided using Equations
12-14 through 12-21.

SSCSDC = 1532 0.02 = 30.6 x 10-9 failures per hour =

30.6 failures per billion hours

SSCSUC = 53 0.02 = 1.1 x 10-9 failures per hour =

1.1 failures per billion hours

SSCDDC = 834 0.02 = 16.7 x 10-9 failures per hour =

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
16.7 failures per billion hours

SSCDUC = 18 0.02 = 0.4 x 10-9 failures per hour =

0.4 failures per billion hours

SSCSDN = 1532 (1 - 0.02) = 1501.4 x 10-9 failures per hour =

1501.4 failures per billion hours

SSCSUN = 53 (1 - 0.02) = 51.9 x 10-9 failures per hour =

51.9 failures per billion hours

SSCDDN = 834 (1 - 0.02) = 817.3 x 10-9 failures per hour =

817.3 failures per billion hours

SSCDUN = 18 (1 - 0.02) = 17.6 x 10-9 failures per hour =

17.6 failures per billion hours

To check for errors, the total is calculated. It should match the

previous total. It does.

System Configurations
A number of configurations exist in real implementations of control sys-
tems. Some simplified configurations are representative. For these simpli-
fied configurations, fault trees and Markov models are developed.

A number of assumptions have been made in these models. It is assumed

that a single board controller module is used. It is assumed that any com-
ponent failure on that single board controller module will cause the entire
module to fail. Constant failure rates and repair rates are assumed. Two
failure modes are assumed. These are labeled safe and dangerous in
accordance with a de-energize-to-trip system.

Copyright International Society of Automation

The models will account for on-line diagnostic capability and common
cause. The Markov models will be solved using time dependent solutions
for a one year mission time. It is assumed that no manual inspection is
done during this year. Therefore, the important variable of manual proof
test coverage is not modeled. This approach shows the architectural differ-
ences more clearly but does not reflect reality of industrial operations
where mission times are much longer, with periodic manual proof testing
done during the mission. Models for those situations should include proof
test coverage as explained in Chapter 12.

It is also assumed that maintenance policies allow the quick repair of

detected dangerous system failures without shutting down the process. In
addition, the models also assume that during a repair call, all pieces of
equipment are inspected and repaired if failures have occurred. This may
or may not be realistic depending on training and maintenance policy. The
Markov models for these systems assume that automatic diagnostic detec-
tion time is much less than system repair time. This assumption is very
realistic since most automatic diagnostic detection times are in the range
of one to five seconds.

1oo1Single Unit
The single controller with single microprocessing unit (MPU) and single
I/O (Figure 14-2) represents a minimum system. No fault tolerance is pro-
vided by this system. No failure mode protection is provided. The elec-
tronic circuits can fail safe (outputs de-energized, open circuit) or
dangerous (outputs frozen or energized, short circuit). Since the effects of
on-line diagnostics should be modeled, four failure categories are
included: DDdangerous detected, DUdangerous undetected, SD
safe detected and SUsafe undetected.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 14-2. 1oo1 Architecture

PFD Fault Tree for 1oo1

Figure 14-3 shows the fault tree for dangerous failures. The system can fail
dangerous if the controller fails DD or DU. Using rough, first order
approximation techniques, a simple formula can be generated from the

Copyright International Society of Automation

fault tree for the probability of dangerous failure (sometimes called proba-
bility of failure on demand, PFD) assuming perfect repair:
PFD1oo1 = DD RT + DU MT (14-1)

And integrating over the mission time period to get the average

PFDavg1oo1 = DD RT + DU MT/2 (14-2)

where RT = average repair time and MT = mission time interval.

Figure 14-3. 1oo1 PFD Fault Tree

EXAMPLE 14-5

Problem: A single board controller is used in a 1oo1 architecture

system. The conventional micro PLC failure rates are obtained from a
FMEDA (Example 14-1). The average repair time is 72 hours. The
mission time is one year (8760 hours). What are the approximate
PFD and PFDavg?

Solution: Using Equation 14-1, PFD1oo1 = 0.000000305 72 +

0.000000600 8760 = 0.00528

Using Equation 14-2, PFDavg1oo1 = 0.000000305 72 +

0.000000600 8760/2 = 0.00265.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

EXAMPLE 14-6

Problem: A single safety controller is used in a 1oo1 architecture

system. The failure rates are obtained from a FMEDA (Example 14-
2). The average repair time is 72 hours. The mission time is one year
(8760 hours). What are the approximate PFD and PFDavg?

Solution: Using Equation 14-1, PFD1oo1 = 0.000000834 72 +

0.000000018 8760 = 0.00021773.

Using Equation 14-2, PFDavg1oo1 = 0.000000834 72 +

0.000000018 8760/2 = 0.00013889.

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 14-4. 1oo1 PFS Fault Tree

PFS Fault Tree for 1oo1

Figure 14-4 shows that the system will fail safe if the controller fails with
SD or SU failures. The same rough approximation techniques can be used
to generate a formula for probability of failing safe, PFS:

PFS1oo1 = SD SD + SU SD (14-3)

where SD = the time required to restart the process after a shutdown.

EXAMPLE 14-7

Problem: A system consists of a 1oo1 single board controller with

conventional micro PLC failure rates from Example 14-1. The
average time to repair the controller and restart the process after a
shutdown is 96 hours. What is the approximate PFS?

Solution: Using Equation 14-3, PFS1oo1 = 0.000000248 96 +

0.000000871 96 = 0.000107.

EXAMPLE 14-8

Problem: A single safety controller is used in a 1oo1 architecture

system. The failure rates are obtained from a FMEDA (Example 14-
2). The average time to repair the controller and restart the process
after a shutdown is 96 hours. What is the approximate PFS?

Solution: Using Equation 14-3, PFS1oo1 = 0.000001532 96 +

0.000000053 96 = 0.00015216.

Copyright International Society of Automation

Figure 14-5. 1oo1 Markov Model

Markov Model for 1oo1

The 1oo1 architecture can also be modeled using a Markov model (Figure
14-5). In the Markov model for this configuration, state 0 represents the
condition where there are no failures. From this state, the controller can
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

reach three other states. State 1 represents the fail-safe condition. In this
state, the controller has failed with its outputs de-energized. State 2 repre-
sents the fail-danger condition with a detected failure. In this state, the
controller has failed with its outputs energized but the failure is detected
by on-line diagnostics and can be repaired. The 1oo1 system has also failed
dangerous in state 3 but the failure is not detected by on-line diagnostics.

Note that state 2 in a 1oo1 architecture can only exist if no automatic shut-
down mechanism has been added to the system. Although partial auto-
matic shutdown can be achieved in a 1oo1 architecture, a second switch is
required to guarantee an automatic shutdown for all dangerous failures
(examplethe output switch). Therefore, a single architecture with auto-
matic shutdown is different and was first named 1oo1D in an earlier edi-
tion of this book [Ref. 1].

The transition matrix P for the 1oo1 system is:

S D SD SU DD DU
1 ( + ) +
SD 1 SD 0 0
P =
O 0 1 O 0
0 0 0 1

Copyright International Society of Automation

The time dependent probabilities can be calculated by multiplying the P

matrix times a row matrix, S, per the methods described in Chapter 8.
Assuming that the controller is working properly when started, the sys-
tem starts in state 0 and the S matrix is: S = [1 0 0 0].

EXAMPLE 14-9

Problem: Calculate the PFS and PFD from the Markov model of a
1oo1 single safety controller using failure rates from Example 14-2.
Average repair time is 72 hours. Average startup time after a
shutdown is 96 hours. The mission time interval is one year.

Solution: The failure rate and repair rate numbers are substituted
into a P matrix. The numeric matrix is:

1oo1 Critical, small safety PLC

P 0 1 2 3
0 0.999997563 0.000001585 0.000000834 0.000000018
1 0.010416667 0.989583333 0.000000000 0.000000000
2 0.013888889 0.000000000 0.986111111 0.000000000
3 0.000000000 0.000000000 0.000000000 1.000000000

The PFS at 8760 hours is 0.00015216. The PFD is the sum of state 2
and state 3 probabilities. At 8760 hours the PFD is 0.00021766. The
PFDavg is calculated by direct numerical average of the time
dependent PFD results. The PFDavg is 0.00013833.

The PFD for the single board controller in a 1oo1 architecture might be
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

improved by the addition of an external watchdog timer. As discussed in

Chapter 9 on diagnostics, many failures may be detected by timing meth-
ods. Figure 14-6 shows a 1oo1 architecture with an external watchdog
timer added. The controller must be programmed to generate a periodic
pulse via one of its outputs. This addition could be analyzed by FMEDA to
determine the diagnostic effectiveness for any specific design. FMEDA
analysis in the past has indicated that an external watchdog timer will
detect somewhere in the range of 7080% of potential dangerous failures.
While this diagnostic technique was useful on early PLC designs with lit-
tle internal diagnostics, the technique is not useful on recent designs with
good internal automatic diagnostics. On such machines, the external
watchdog timer does nothing more than increase the false trip rate of the
system.

Copyright International Society of Automation

Figure 14-6. 1oo1 Architecture with Watchdog Timer

1oo2Dual Unit System

Two controllers can be wired to minimize the effects of dangerous control-
ler failures. For de-energize-to-trip systems, a series connection of two
output circuits requires that both controllers fail dangerous for the system

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
to fail dangerous. The 1oo2 configuration typically utilizes two indepen-
dent main processors each with its own independent I/O (see Figure 14-
7). The system offers low probability of failure on demand, but it increases
the probability of a fail-safe failure.

+
Output Circuit
Logic Solver
Input Circuit Common Circuitry
LMP
Sensor
Output Circuit
Logic Solver
Input Circuit Common Circuitry
LMP Actuator
Final Element

-
Figure 14-7. 1oo2 Architecture

Copyright International Society of Automation

Figure 14-8. 1oo2 PFD Fault Tree

PFD Fault Tree for 1oo2

Figure 14-8 shows the PFD Fault Tree for the 1oo2 architecture. The sys-
tem can fail dangerous if both controllers fail dangerous due to a common
cause failure, detected or undetected. Other than for common cause, it can
fail dangerous only if both A and B fail dangerous. A first order approxi-
mation for the PFD and PFS can be derived from the fault tree. The equa-
tion for the PFD is:

(
PFD = DUN MT )2
(
+ 2 DDN RT DUN MT + DDN RT ) ( ) 2
+ DDC RT + DUC MT (14-4)

The equation for the PFDavg can be derived by integrating the PFD equa-
tion and dividing by the time interval. This average approximation is
given by:

[( ]
MT
PFDavg =
1
MT
DUN
t )
2
(
+ 2 DDN DUN RTt + DDN RT )
2
+ DDC RT + DUC t dt
0

which evaluates to

1 DUN t3 t2 t 2 MT
PFDavg =
MT
( )
2

3
2
( )
+ 2 DDN DUN RT + DDN RT t + DDC RTt + DUC
2 2 0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Substituting the integration limits:

PFDavg =
( )DUN 2
MT 2
+ DDN DUN RT MT + DDN * RT ( )2
+ DDC RT + DUC
MT
(14-5)
3 2

Copyright International Society of Automation

It should be noted that Equation 12-23 is the same as Equation 14-5 except
for the MT notation instead of the TI notation. The reader may compare
the equations to see how the level of model detail impacts the equation.

EXAMPLE 14-10

Problem: Two single board controllers are used in a 1oo2

architecture system. The failure rates are obtained from the FMEDA
of a conventional micro PLC (Example 14-3). The average repair time
is 72 hours. The mission time is one year (8760 hours). What is the
approximate PFD?

Solution: Using Equation 14-4, PFD1oo2 = 0.000132. This is

considerably better than the 1oo1 single board controller PFD which
was 0.00528.

EXAMPLE 14-11

Problem: Two single safety controllers are wired into a 1oo2

architecture system. The failure rates are obtained from an FMEDA
(Example 14-4). The average repair time is 72 hours. The mission
time is one year (8760 hours). What is the approximate PFD?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Solution: Using Equation 14-4, PFD1oo2 = 0.0000044.

System Fails
Safely

A&B fail A&B fail A fails A fails B fails B fails

SDC SUC SDN SUN SDN SUN

Figure 14-9. 1oo2 PFS Fault Tree

PFS Fault Tree for 1oo2

Figure 14-9 shows the PFS fault tree for the 1oo2 architecture. This shows
the tradeoff in the architecture: any safe failure from either unit will cause
a false trip. An approximation for the PFS can be derived from the fault
tree.

PFS1oo2 = (SDC + SUC + 2SDN + 2SUN) SD (14-6)

Copyright International Society of Automation

EXAMPLE 14-12

Problem: Two single board controllers are used in a 1oo2

architecture system. The failure rates are obtained from the FMEDA
of a conventional micro PLC (Example 14-3). The average time to
restart the process after a shutdown is 96 hours. What is the
approximate PFS?

Solution: Using Equation 14-6, PFS1oo2 = 0.000213.

EXAMPLE 14-13

Problem: Two single safety controllers are wired into a 1oo2

architecture system. The failure rates are obtained from an FMEDA
(Example 14-4). The average time to restart the process after a
shutdown is 96 hours. What is the approximate PFS?

Solution: Using Equation 14-6, PFS1oo2 = 0.00030128.

Markov Model for 1oo2

The Markov model for a 1oo2 single board system is shown in Figure 14-
10. (Note: This is the same as Figure 12-12, with the same assumptions.)
Three system success states exist. In state 0, both controllers operate suc-
cessfully. In state 1 and state 2, one controller has failed with outputs ener-
gized. The system is successful because no false trip has occurred and the
other controller can still de-energize as required for safety shutdown.
Since failures in state 1 are detected, an on-line repair rate returns the sys-
tem from state 1 to state 0. State 3, 4, and 5 are the system failure states. In
state 3, the system has failed with its outputs de-energized. In state 4, the
system has failed detected with its outputs energized. An undetected fail-
ure with outputs energized has occurred in state 5. Note that only a dan-
gerous common-cause failure will move the system from state 0 to state 4
or 5.

The on-line repair rate from state 4 to state 0 assumes that the repair tech-
nician will inspect the system and repair all failures when making a ser-
vice call. If that assumption is not valid, state 4 must be split into two
states, one with both controllers failed detected and the other with one
detected and one undetected. The state with both detected will repair to
state 0. The state with only one detected will repair to state 2. The assump-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

tion made for this model does simplify the model and has no significant
impact on the fail-danger probability unless coverage factors are low.

Copyright International Society of Automation

Figure 14-10. 1oo2 Markov Model

The transition matrix, P, for the 1oo2 system is:

DC DN SC SN DDN DUN SC SN DDC DUC

1 ( + 2 + + 2 ) 2 2 + 2
S D S D
O 1 + + O 0 0

P = 0 0
S D
1 ( + )
S

DD

DU

SD 0 0 1 SD 0 0
O 0 0 0 1 O 0
0 0 0 0 0 1

Numeric solutions for the PFD, PFS, MTTF (mean time to failure) and
other reliability metrics can be obtained from this matrix using a spread-
sheet.

EXAMPLE 14-14

Problem: Two single safety controllers are used in a 1oo2

architecture. Using the failure rates from Example 14-4, calculate the
system PFS and PFD for a mission time of 8760 hours. The average
repair time from a detected failure is 72 hours. The average startup
time from a shutdown is 96 hours.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 14-14 continued

Solution: First the failure rate data is substituted into the transition
matrix:

1oo2 Critical, small safety PLC

P 0 1 2 3 4 5
0 0.9999952 0.0000016 0.0000000 0.0000031 0.0000000 0.0000000
1 0.0138889 0.9861087 0.0000000 0.0000016 0.0000009 0.0000000
2 0.0000000 0.0000000 0.9999976 0.0000016 0.0000008 0.0000000
3 0.0104167 0.0000000 0.0000000 0.9895833 0.0000000 0.0000000
4 0.0138889 0.0000000 0.0000000 0.0000000 0.9861111 0.0000000
5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000

The PFD and PFS at 8760 hours are calculated by repeatedly

multiplying a starting row matrix (starting in state 0) by the P matrix.
The PFD at 8760 hours equals 0.0000044. The PFS at 8760 hours
equals 0.00030112.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 14-11. 2oo2 Architecture

2oo2Dual Unit System

Another dual controller configuration was developed for the situation in
which it is undesirable for the system to fail with outputs de-energized.
The outputs of two controllers are wired in parallel (Figure 14-11). If one
controller fails with its output de-energized, the other is still capable of
energizing the load.

A disadvantage of the configuration is its susceptibility to failures in

which the output is energized. If either controller fails with its output
energized, the system fails with output energized. This configuration is
unlikely to be suitable for de-energize-to-trip protection systems unless
each controller is of an inherently fail-safe design.

Copyright International Society of Automation

SYSTEM FAILS
DANGEROUSLY

A&B FAIL A&B FAIL A FAILS A FAILS B FAILS B FAILS

DDC DUC DDN DUN DDN DUN

Figure 14-12. PFD Fault Tree

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

PFD Fault Tree for 2oo2

Since the controllers are wired in parallel, any short circuit (dangerous)
failure of the components results in a dangerous (outputs energized) fail-
ure of the system. This is shown in Figure 14-12. The first order approxi-
mation equation to solve for PFD is

PFD2oo2 = DDC RT + DUC MT + 2 DDN RT + 2 DUN MT (14-7)

EXAMPLE 14-15

Problem: Two single board controllers are used in a 2oo2

architecture system. The failure rates are obtained from the FMEDA
of a conventional micro PLC (Example 14-3). The average repair time
is 72 hours. Mission time is one year (8760 hours). What is the
approximate PFD?

Solution: Using Equation 14-7, PFD2oo2 = 0.01045.

EXAMPLE 14-16

Problem: Two single safety controllers are wired into a 2oo2

architecture system. The failure rates are obtained from a FMEDA
(Example 14-4). The average repair time is 72 hours. Mission time is
one year (8760 hours). What is the approximate PFD?

Solution: Using Equation 14-7, PFD2oo2 = 0.0004311.

Copyright International Society of Automation

System Fails
Safely

A&B fail A&B fail

SDC SUC

A fails B fails
SN SN

A fails A fails B fails B fails

SDN SUN SDN SUN

Figure 14-13. 2oo2 PFS Fault Tree

PFS Fault Tree for 2oo2

This architecture is designed to tolerate an open circuit failure. The fault
tree of Figure 14-13 shows this. The system will fail open circuit (de-ener-
gized, safe) if there is a safe common cause failure. Other than for common

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cause, an open circuit failure in both A and B must occur for the system to
fail safe. The first order approximation equation to solve for PFS is

PFS2oo2 = SDC SD + SUC SD + (SDN RT + SUN MT)2 (14-8)

EXAMPLE 14-17
Problem: Two single board controllers are used in a 2oo2
architecture system. The failure rates are obtained from the FMEDA
of a conventional micro PLC (Example 14-3). The average repair time
is 72 hours. Mission time is one year (8760 hours). The average time
to restart the process after a shutdown is 96 hours. What is the
approximate PFS?
Solution: Using Equation 14-8, PFS2oo2 = 0.000058.

EXAMPLE 14-18
Problem: Two single safety controllers are wired into a 2oo2
architecture system. The failure rates are obtained from a FMEDA
(Example 14-4). The average repair time is 72 hours. Mission time is
one year (8760 hours). The average time to restart the process after
a shutdown is 96 hours. What is the approximate PFS?
Solution: Using Equation 14-8, PFS2oo2 = 0.00000336.

Copyright International Society of Automation

Figure 14-14. 2oo2 Markov Model

Markov Model for 2oo2

The single board controller Markov model for the 2oo2 architecture is
shown in Figure 14-14. The system is successful in three states: 0, 1, and 2.
The system has failed with outputs de-energized in state 3. The system has
failed with outputs energized in states 4 and 5. A comparison of this
Markov to that of the 1oo2 architecture will show a certain symmetry,
which makes sense given the architectures. The P matrix for this model is:

SC SN DC DN SDN SUN SC DDC DDC DUC DUN

1 ( + 2 + + 2 ) 2 2 + 2 + 2
S D S D
O 1 + + O 0 0

P = 0 0
S D
1 ( + )
S

DD

DU

SD 0 0 1 SD 0 0
0 0 0 1 0
O O
0 0 0 0 0 1

Numeric solutions for the various reliability and safety metrics are practi-
cal and precise.

Copyright International Society of Automation

--``,,`,,,`,,`

EXAMPLE 14-19

Problem: Two single safety controllers are used in a 2oo2

architecture. Using the failure rates from Example 14-4, calculate the
PFS and PFD for a time interval of 8760 hours.

Solution: Substituting numerical values into the P matrix yields:

2oo2 Critical, small safety PLC
P 0 1 2 3 4 5
0 0.999995 0.000003 0.000000 0.000000 0.000002 0.000000
1 0.013889 0.986109 0.000000 0.000002 0.000001 0.000000
2 0.000000 0.000000 0.999998 0.000002 0.000001 0.000000
3 0.010417 0.000000 0.000000 0.989583 0.000000 0.000000
4 0.013889 0.000000 0.000000 0.000000 0.986111 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000

The PFS and PFD are calculated by multiplying the P matrix by a row
matrix S starting with the system in state 0. When this is done, the
PFD at 8760 hours = 0.00043076 and the PFS at 8760 hours =
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

0.000000321. Note that the PFS is very low. This is expected for a
2oo2 architecture.

+
Diagnostic Circuit(s)

Output Circuit
Sensor Input Circuit Logic Solver
Common Circuitry
LMP Actuator
Final Element

-
Figure 14-15. 1oo1D Architecture

1oo1DDiverse Unit System

Figure 14-15 shows an architecture that uses a single controller unit with
diagnostic capability and a second diverse design diagnostic unit wired in
series to utilize the diagnostic signals from the controller unit to de-ener-
gize the output. This system represents an enhancement of the 1oo1 archi-
tecture and is used for certain de-energize to trip safety applications.

A 1oo1D architecture can be built using a single board controller and an

external watchdog timer if the timer has an output that can be wired in
series with the controller output. This is shown in Figure 14-16. In this con-
figuration, the watchdog timer serves as the diagnostic unit.

In more advanced systems, built-in diagnostics control an independent

series output that will force the system to a de-energized state when a fail-

Copyright International Society of Automation

Figure 14-16. 1oo1D Architecture Built with Watchdog Timer

ure within the single board controller is detected. Diagnostics configured

in this manner force a detected dangerous failure to be converted into a
safe failure. Additional failure rates beyond those in a single unit must be
included in quantitative analysis to account for the extra diagnostic unit.
In systems using external diagnostic control devices (such as watchdog
timers), additional failure rates for these external devices must be added
to the single board rates.

Figure 14-17. 1oo1D PFD Fault Tree

PFD Fault Tree for 1oo1D

The 1oo1D architecture has a second diagnostic unit that will de-energize
the outputs when failures are detected by the diagnostics. Therefore, two
situations can cause system failure. The first occurs when the main logical
unit fails DU. Another combination of failures can cause the second situa-
tion where the system will fail dangerous. If a failure in the diagnostic unit

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

and a failure in the main logical unit (previously categorized as dangerous

detected) occurs, dangerous detected failures are no longer automatically
converted into safe failures. The fault tree is shown in Figure 14-17. The
approximation for the PFD is

PFD1oo1D = DU MT + (AU MT DD MT) (14-9)

The failure rate labeled with the superscript AU is called annunciation

undetected. This name was first given in a paper from 2001 [Ref. 2]. An
example of this type of failure would be a short circuit failure of the diag-
nostic switch. If this were the only failure in the system, that switch would
no longer open upon indication of a dangerous detected failure. However,
the safety function performed by the 1oo1 unit would still work properly.

EXAMPLE 14-20

Problem: A single safety controller is used in a 1oo1D architecture

system. The failure rates are obtained from a FMEDA (Example 14-
2); however, extra diagnostic unit components are added. The
FMEDA results for the diagnostic unit are:

SD = 0 FITS,
SU = 30 FITS, and
AU = 20 FITS.

What are the total failure rates of the system?

Solution:

The total failure rates are:

SSC1oo1DSD = 1532 + 0 = 1532 x 10-9 failures per hour =

1532 failures per billion hours

SSC1oo1DSU = 53 + 30 = 83 x 10-9 failures per hour =

83 failures per billion hours
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

SSC1oo1DDD = 834 + 0 = 834 x 10-9 failures per hour =

834 failures per billion hours

SSC1oo1DDU = 18 + 0 = 18 x 10-9 failures per hour =

18 failures per billion hours

SSC1oo1DAU = 0 + 20 = 20 x 10-9 failures per hour =

20 failures per billion hours.

Copyright International Society of Automation

EXAMPLE 14-21

Problem: A single safety controller is used in a 1oo1D architecture

system. The failure rates are obtained from Example 14-20. The
average repair time is 72 hours. Mission time is one year (8760
hours). What is the approximate PFD?

Solution: Using Equation 14-9,

PFD1oo1D = 0.000000018 8760 + 0.00000002 8760

0.000000834 8760

= 0.00015768 + 0.000001280 = 0.00015896.

It is clear that the first term of the equation is most significant. This is
true until the diagnostic coverage gets above 98%. Then the second
term becomes significant and should not be ignored.

Figure 14-18. 1oo1D PFS Fault Tree

PFS Fault Tree for 1oo1D

Figure 14-18 shows that a 1oo1D architecture will fail safe if the controller
fails with SD, SU, or DD failures. Approximation techniques can be used
to generate a formula for the probability of failing safe from this fault tree:

PFS1oo1D = (SD + SU + DD) SD (14-10)

where SD = the time required to restart the process after a shutdown.

EXAMPLE 14-22

Problem: A system consists of a 1oo1D single safety controller with

failure rates from Example 14-20. The average time to restart the
process after a shutdown is 96 hours. What is the approximate PFS?

Solution: Using Equation 14-10, PFS1oo1D = (0.000001532 +

0.000000083 + 0.000000834) 96 = 0.00023510.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Markov Model for 1oo1D

The 1oo1D architecture can also be modeled using a Markov model (Fig-
ure 14-19). In the Markov model for this configuration, state 0 represents
the condition where there are no failures. From this state, the controller
can reach three other states. State 1 represents the condition where the
diagnostic annunciation unit has failed but the safety function continues
to operate correctly. State 2 represents the fail-safe condition. In this state,
the controller has failed with its outputs de-energized. The system has
failed dangerous in State 3 and the failure is not detected by on-line diag-
nostics. The Markov model for the 1oo1D is similar to that of the 1oo1
except that the dangerous detected failures automatically trip the system
and annunciation failures are modeled to see if they are relevant.

Figure 14-19. 1oo1D Markov Model

The transition matrix, P, for the 1oo1D system is:

1 - ( SD + SU + DD + DU + AU ) AU ( SD + SU + DD) DU

0 1 - ( SD + SU + DD + DU ) ( SD + SU ) ( DD + DU )
P ==
SD 0 1 SD 0

0 0 0 1

EXAMPLE 14-23

Problem: Using failure rates given in Example 14-20, calculate the

system PFS and PFD for a 1oo1D architecture. Average repair time is
72 hours. Average time to start up the process after a shutdown is 96
hours. The system is operated for a one year mission time.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 14-23 continued

Solution: Substituting numeric values into the P matrix:

1oo1D small safety PLC with diagnostic channel

P 0 1 2 3
0 0.999997513 0.000000020 0.000002449 0.000000018
1 0.000000000 0.999997533 0.000001615 0.000000852
2 0.010416667 0.000000000 0.989583333 0.000000000
3 0.000000000 0.000000000 0.000000000 1.000000000

Multiplication by the S matrix is used to calculate time dependent

state probabilities. The PFD at 8760 hours = 0.00015827 and the
PFS at 8760 hours = 0.00023500.

2oo3Triple Unit System

It is hard, if not impossible to choose between failure modes in some
applications of control systems. The 1oo2 configuration reduces danger-
ous (output energized) failures. The 2oo2 configuration reduces safe (out-
put de-energized) failures. More sophisticated architectures are required
when both failure modes must be protected against. A classic architecture
designed to tolerate both safe and dangerous failures is the 2oo3 (two suc-
cessful controller units out of three are required for the system to operate).
This architecture provides both safety and high availability with three
controller units.

Two outputs from each controller unit are required for each output unit.
The two outputs from the three controllers are wired in a voting circuit,
which determines the actual output (Figure 14-20). The output will equal
the majority. When two sets of outputs conduct, the load is energized.
When two sets of outputs do not conduct, the load is de-energized.

A closer examination of the voting circuit shows that it will tolerate a fail-
ure in either failure modedangerous (short circuit) or safe (open circuit).
Figure 14-21 shows that when one unit fails open circuit, the system effec-
tively degrades to a 1oo2 configuration. If one unit fails short circuit the
system effectively degrades to a 2oo2 configuration. In both cases, the sys-
tem remains in successful operation.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 14-20. 2oo3 Architecture

Figure 14-21. 2oo3 Architecture Single Fault Degradation Models

PFD Fault Tree for 2oo3

The 2oo3 architecture will fail dangerous only if two units fail dangerous
(Figure 14-22). There are three ways in which this can happen: the AB leg
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

can fail short circuit, the AC leg can fail short circuit, or the BC leg can fail
short circuit. These are shown in the top level events of the PFD fault tree
of Figure 14-23. Each leg consists of two switches wired in series like a
1oo2 configuration. The subtree for each leg is developed for the 1oo2 con-
figuration and each looks like Figure 14-8, the 1oo2 PFD fault tree. It
should be noted that the system will also fail dangerous if all three legs fail
dangerous. This can happen due to common cause or a combination of
three independent failures. Since this is a second order effect, it can
assumed to be negligible for first order approximation purposes. This is

Copyright International Society of Automation

Figure 14-23. 2oo3 PFD Fault Tree

Figure 14-22. 2oo3 Architecture Dual Fault Failure Modes

indicated in the fault tree with the incomplete event symbol. An approx-
imation equation for PFD can be derived from the fault tree. The first
order approximate equation for the PFD is:

PFD2oo3 = 3/2DUC MT + 3/2DDC RT + 3(DDN RT + DUN MT)2 (14-11)

Note that a factor of 3/2 was used to scale the beta model for common
cause for a three unit system. This was explained in Chapter 10.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 14-24

Problem: Three single board controllers are used in a 2oo3

architecture system. The failure rates are obtained from the FMEDA
of a conventional micro PLC (Example 14-1). The system requires
eight inputs and four outputs; therefore, each single board controller
will require eight input channels and eight output channels. The
average repair time is 72 hours. The mission time is one year (8760
hours). Common cause must be considered and a beta factor of 0.02
is estimated. Calculate the total failure rate per unit and the
approximate PFD using those failure rates.

Solution: The total failure rate per unit equals:

SD = 8 0 + 144 + 8 26 = 352 FITS

SU = 8 70 + 203 + 8 27 = 979 FITS
DD = 8 0 + 289 + 8 4 = 321 FITS
DU = 8 24 + 272 + 8 34 = 736 FITS

Accounting for common cause:

SDC = 0.02 0.000000352 = 0.000000007 failures per hour = 7 FITS
SUC = 0.02 0.000000979 = 0.000000020 failures per hour = 20 FITS
SDN = (1 - 0.02) 0.000000352 = 0.000000345 failures per hour = 345 FITS
SUN = (1 - 0.02) 0.000000979 = 0.000000959 failures per hour = 959 FITS
DDC = 0.02 0.000000321 = 0.000000006 failures per hour = 6 FITS
DUC = 0.02 0.000000736 = 0.000000015 failures per hour = 15 FITS
DDN = (1 - 0.02) 0.000000321 = 0.000000315 failures per hour = 315 FITS
DUN = (1 - 0.02) 0.000000736 = 0.000000721 failures per hour = 721 FITS

Using the above failure rates in Equation 14-11, PFD2oo3 = 0.000315.

This result should be compared to Example 14-11, 1oo2 single board
controller, and Example 14-15, 2oo2 single board controller. The
1oo2 architecture PFD is 0.00013. The 2oo2 architecture PFD is
0.01045. The 1oo2 PFD is about three times better than that of the
2oo3 but the 2oo3 is over an order of magnitude better than that of
the 2oo2.

EXAMPLE 14-25

Problem: Three single safety controllers are wired into a 2oo3

architecture system. The failure rates are obtained from an FMEDA
(Example 14-2). The system requires eight inputs and four outputs;
therefore each single board controller will require eight input channels
and eight output channels. The average repair time is 72 hours. The
mission time is one year (8760 hours). The common cause beta
factor is estimated to be 0.02. Calculate the unit failure rates and the
approximate PFD.
--``,,`,,,`,,`,`,,,```,,,

Copyright International Society of Automation

EXAMPLE 14-25 continued

Solution: The unit failure rates are:

SD = 8 87 + 496 + 8 85 = 1872 FITS

SU = 8 1 + 5 + 8 10 = 93 FITS
DD = 8 22 + 458 + 8 50 = 1034 FITS
DU = 8 1 + 6 + 8 1 = 22 FITS

Accounting for common cause:

SDC = 0.02 0.000001872 = 0.000000037 failures per hr = 37 FITS
SUC = 0.02 0.000000093 = 0.000000002 failures per hr = 2 FITS
SDN = (1 - 0.02) 0.000001872 = 0.000001835 failures per hr = 1835 FITS
SUN = (1 - 0.02) 0.000000093 = 0.000000091 failures per hr = 91 FITS
DDC = 0.02 0.000001034 = 0.000000021 failures per hr = 21 FITS
DUC = 0.02 0.000000022 = 0.0000000004 failures per hr = 0.4 FITS
DDN = (1 - 0.02) 0.000001034 = 0.000001013 failures per hr = 1013 FITS
DUN = (1 - 0.02) 0.000000022 = 0.0000000216 failures per hr = 21.6 FITS

Substituting these failure rates into Equation 14-11, PFD2oo3 =

0.00000822.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 14-24. PFS Fault Tree

PFS Fault Tree for 2oo3

The 2oo3 is a symmetrical architecture that successfully tolerates either a
short circuit or an open circuit failure. It will fail with outputs de-ener-
gized only when two failures occur as shown in Figure 14-22. The fault
tree for safe failures is shown in Figure 14-24. It looks like the fault tree for

Copyright International Society of Automation

dangerous failures except that the failure modes are different. This is the
result of the symmetrical nature of the architecture. Note that each major
event in the top level of the fault tree is equivalent to the 2oo2 fault tree of
Figure 14-13. The approximate equation for the PFS derived from this fault
tree is:

PFS2oo3 = 3/2SDC SD + 3/2SUC SD + 3(SDN RT + SUN MT)2 (14-12)

EXAMPLE 14-26

Problem: Three single board controllers are used in a 2oo3

architecture system. The average repair time is 72 hours. The
average time to repair a controller and restart the process is 96
hours. The mission time is one year (8760 hours). Using failure rates
from Example 14-24, what is the approximate PFS?

Solution: Using Equation 14-12, PFS2oo3 = 0.000217.

EXAMPLE 14-27

Problem: Three single safety controllers are wired into a 2oo3

Solution: Using Equation 14-12, PFS2oo3 = 0.00000826.

Markov Model for 2oo3

A Markov model can be created for the 2oo3 architecture. Model construc-
tion begins with all three controller units fully operational in state 0. All
four failure modes of the three units must be placed in the model as exit
rates from state 0. In addition, the four failure modes of common cause
failures must be included in the model. Since there are three combinations
of two units, AB, AC and BC, three sets of common cause failures are
included (much like the three sets in the fault tree diagrams). When all
these failure rates are placed, the partial model looks like Figure 14-25. In
state 1, one controller has failed safe detected. In state 2, one controller has
failed safe undetected. In both states 1 and 2 the system has degraded to a
1oo2. In state 3, one controller has failed dangerous detected. In state 4,
one controller has failed dangerous undetected. The system has degraded
to a 2oo2 in states 3 and 4. The system has not failed in any of these states.

From states 1 and 2 the system is operating. Further safe failures will fail
the system with outputs de-energized (safe). Further dangerous failures

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 14-25. Partially Developed 2oo3 Markov Model

will lead to secondary degradation. From states 3 and 4 the system is oper-
ating in 2oo2 mode. Additional dangerous failures will fail the system
with outputs energized. Further safe failures degrade the system again.
Repair rates are added to the diagram. It is assumed that the system is
inspected and that all failed controllers are repaired if a service call is
made. Therefore, all repair rates transition to state 0.

The completed 2oo3 architecture Markov model is shown in Figure 14-26.

Several interesting patterns are visible. The 2oo2 degradation mode (when
one dangerous failure occurs) is shown in states 3, 5, 7, 9, 10, and 11. These
are redrawn in Figure 14-27. A comparison of the failure rates in the 2oo2
Markov model of Figure 14-14 will show they are nearly the same as those
of Figure 14-27.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 14-26. 2oo3 Markov Model

The transition matrix for Figure 14-26 is:

1 3SDN 3SUN 3DDN 3DUN 0 0 0 0 3 / 2SC 3 / 2DDC 3 / 2DUC

O 1 2 2 + 2SN DUC
DDN DUN SC DDC
0 0 0 0 0
0 0 1 0 0 0 0 2DDN 2DUN SC + 2SN DDC DUC

O 0 0 1 0 2SDN
0 2 SUN
0 SC + 2DN
DC
0
0 0 0 0 1 0 2SDN
0 2 SUN
SC DDC + 2DDN + 2
DUC DUN

O 0 0 0 0 1 0 0 0 S D 0
P=
0 0 0 0 0 1 0 0 S DD DU
O
O 0 0 0 0 0 0 1 0 S D 0

0 0 0 0 0 0 0 0 1 S DD DU
SD 0 0 0 0 0 0 0 0 1 0 0

O 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 1

where 1 indicates one minus the sum of all other row elements.
Reliability and safety metrics can be calculated from this P matrix using
numerical techniques.

Copyright International Society of Automation

Figure 14-27. 2oo2 Degradation Mode of 2oo3

EXAMPLE 14-28

Problem: Three single safety controllers are used in a 2oo3

architecture. Average repair time is 72 hours. Average startup time
after a shutdown is 96 hours. Using the failure rates from Example
14-25 to account for two outputs per unit per channel, calculate the
MTTF, PFS, and PFD for a mission time of 8760 hours.

Solution: Substituting numerical values into the P matrix yields:

2oo3 Safety
P 0 1 2 3 4 5 6 7 8 9 10 11
0 0.999991 5.5E-06 2.73E-07 3.04E-06 6.47E-08 0 0 0 0 5.895E-08 3.1E-08 6.6E-10
1 0.013889 0.986105 0 0 0 2.03E-06 4.31E-08 0 0 3.891E-06 2.1E-08 4.4E-10
2 0 0 0.999994 0 0 0 0 2.03E-06 4.31E-08 3.891E-06 2.1E-08 4.4E-10
3 0.013889 0 0 0.986105 0 3.67E-06 0 1.82E-07 0 3.93E-08 2.1E-06 0
4 0 0 0 0 0.999994 0 3.67E-06 0 1.82E-07 3.93E-08 2E-06 4.4E-08
5 0.013889 0 0 0 0 0.986108 0 0 0 1.965E-06 1.1E-06 0
6 0.013889 0 0 0 0 0 0.986108 0 0 1.965E-06 1E-06 2.2E-08
7 0.013889 0 0 0 0 0 0 0.986108 0 1.965E-06 1.1E-06 0
8 0 0 0 0 0 0 0 0 0.999997 1.965E-06 1E-06 2.2E-08
9 0.010417 0 0 0 0 0 0 0 0 0.9895833 0 0
10 0.013889 0 0 0 0 0 0 0 0 0 0.98611 0
11 0 0 0 0 0 0 0 0 0 0 0 1

The PFS and PFD are calculated by multiplying the P matrix by a row
matrix S starting with the system in state 0. When this is done, the
PFD at 8760 hours = 0.00000822 and the PFS at 8760 hours =
0.00000665.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 14-28. 2oo2D Architecture

2oo2D Architecture
The 2oo2D is an architecture that consists of two 1oo1D controllers
arranged in a 2oo2 style (Figure 14-28). Since the 1oo1D protects against
dangerous failures when the diagnostics detect the failure, two controllers
can be wired in parallel to protect against a false trip. As the 2oo2 architec-
ture provides the best possible architecture to provide fault tolerance
against false trips, the 2oo2D is designed to also provide highly effective
protection against false trips and safety if implemented well. Effective
diagnostics are important to this architecture as an undetected dangerous
failure in either controller will fail the system dangerous.

Figure 14-29. PFD Fault Tree for 2oo2D

Copyright International Society of Automation

PFD Fault Tree for 2oo2D

The 2oo2D architecture will fail with outputs energized if either controller
unit has a dangerous undetected failure or if the system experiences a
dangerous undetected common cause failure. It will also fail with outputs
energized if there is a combination of a diagnostic annunciation failure
and a dangerous detected failure. This is shown in the fault tree of Figure
14-29. The approximation equation for the PFD is:

PFD2oo2D = DUCMT + 2DUNMT + (AUCMTDDCMT) +

2(AUNMTDDNMT) (14-13)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 14-29

Problem: Two 1oo1D single safety controllers are used in a 2oo2D

architecture system. Using the failure rates for each 1oo1D unit from
Example 14-22 and a beta factor of 0.02, what are the failure rates
accounting for common cause? The mission time is one year (8760
hours). What is the approximate PFD?

Solution: The failure rates are divided using Equations 12-21

through 12-28.

The failure rates are:

SDC1oo1D = 0.000000031 failures per hour = 31 FITS
SUC1oo1D = 0.000000002 failures per hour = 2 FITS
SDN1oo1D = 0.000001501 failures per hour = 1501 FITS
SUN1oo1D = 0.000000081 failures per hour = 81 FITS
DDC1oo1D = 0.000000017 failures per hour = 17 FITS
DUC1oo1D = 0.0000000004 failures per hour = 0.4 FITS
DDN1oo1D = 0.000000817 failures per hour = 817 FITS
DUN1oo1D = 0.0000000176 failures per hour = 17.6 FITS
AUC1oo1D = 0.0000000004 failures per hour = 0.4 FITS
AUN1oo1D = 0.0000000196 failures per hour = 19.6 FITS

Using Equation 14-13,

Term 1: 0.0000000004 8760 = 0.0000031536
Term 2: 2 0.0000000176 8760 = 0.0003090528
Term 3: 0.0000000004 8760 0.000000017 8760 = 0.0000000005
Term 4: 2 0.0000000196 8760 0.000000817 8760 = 0.0000024586
And PFD2oo2D = 0.0003146655.

It is clear by looking at each term from the equation that annunciation

failures do not have significant impact for this set of example data.
Therefore they will not be included in remaining models. This
simplification will allow more focus on the architecture differences.

Copyright International Society of Automation

Figure 14-30. PFS Fault Tree for 2oo2D

PFS Fault Tree for 2oo2D

Figure 14-30 shows that a 2oo2D architecture will fail safe only if both con-
troller units fail safe. This can happen due to common cause failures SDC,
SUC, or DDC or if A and B fail safe. From this fault tree first order approx-
imation techniques can be used to generate a formula for the probability of
failing safe:

PFS2oo2D = (SDC + SUC + DDC) SD

+ (SDN RT + DDN RT + SUN MT)2 (14-14)

where SD = the time required to restart the process after a shutdown, RT =

average repair time for a detected failure, and MT = mission time interval.

EXAMPLE 14-30

Problem: A system consists of two 1oo1D single safety controllers

wired in a 2oo2D configuration with failure rates from Example 14-29.
The average time to restart the process after a shutdown is 96 hours.
Average repair time for a detected failure that does not shut down the
process is 72 hours. The mission time in one year (8760 hours). What
is the approximate PFS?

Solution: Using Equation 14-14, PFS2oo2D = 0.00000548.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 14-31. Markov Model for 2oo2D

Markov Model for 2oo2D

The Markov model for a 2oo2D system is shown in Figure 14-31. Three
system success states that are similar to the other dual systems previously
developed are shown. State 1 is an interesting case. It represents a safe
detected failure or a dangerous detected failure. The result of both failures
is the same since the diagnostic cutoff switch de-energizes the output
whenever a dangerous failure is detected. The only other system success
state, state 2, represents the situation in which one controller has failed
safe undetected. The system operates because the other controller man-
ages the load.

The 2oo2D architecture shows good tolerance to both safe and dangerous
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

failures. Good diagnostic coverage is required to achieve high safety rat-

ings. The transition matrix for the single board Markov model of Figure
14-31 is:
SDN DDN SUN SC DDC DUC DUN
1 ( 2 + 2 ) 2 ( + ) 0 ( + 2 )
S DD DU
O 1 0 ( + ) ( ) 0
S DD DU
P = 0 0 1 ( + ) 0 ( )
SD 0 0 1 SD 0 0
O 0 0 0 1 O 0
0 0 0 0 0 1

Copyright International Society of Automation

where represents the sum of the remaining row elements.

EXAMPLE 14-31

Problem: Using failure rate values from Example 14-29, calculate the
PFS and PFD of a 2oo2D system.

Solution: Substituting numerical values into the P matrix yields:

2oo2D Safety
P 0 1 2 3 4 5
0 0.99999512 0.00000464 0.00000016 0.00000005 0.00000000 0.00000004
1 0.01388889 0.98610864 0.00000000 0.00000245 0.00000002 0.00000000
2 0.00000000 0.00000000 0.99999753 0.00000245 0.00000000 0.00000002
3 0.01041667 0.00000000 0.00000000 0.98958333 0.00000000 0.00000000
4 0.01388889 0.00000000 0.00000000 0.00000000 0.98611111 0.00000000
5 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 1.00000000

Multiplying the P matrix times a starting matrix S gives the results that
the PFD at 8760 hours is 0.00031194 and the PFS at 8760 hours is
0.00000510.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 14-32. 1oo2D Architecture

1oo2D Architecture
The 1oo2D architecture is similar to the 2oo2D but it has extra control lines
to provide 1oo2 safety functionality. These control lines signal diagnostic
information between the two controller units. The controllers use this
information to control the diagnostic switches. Figure 14-32 shows the

Copyright International Society of Automation

1oo2D architecture. Comparing this to Figure 14-28, note the added con-
trol lines. 1oo2D is designed to tolerate both safe and dangerous failures.

The primary difference between 2oo2D and 1oo2D can be seen when a
dangerous undetected failure occurs in one controller. This is shown in
Figure 14-33. The upper unit has a failure that causes the output switch to
fail short circuit. The failure is not detected by the self-diagnostics in that
unit so the diagnostic switch is not opened by its control electronics. How-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
ever, when the failure is detected by the other unit, the diagnostic switch
is opened via the additional control line.

Figure 14-33. 1oo2D with Dangerous Undetected Failure in One Unit

Figure 14-34. 1oo2D PFD Fault Tree

Copyright International Society of Automation

PFD Fault Tree for 1oo2D

The 1oo2D fails dangerous only if both controllers fail dangerous and that
failure is not detected by the diagnostics in either unit. The fault tree is
shown in Figure 14-34. An approximate PFD equation developed from the
fault tree is

PFD1oo2D = DUC MT + (DUN MT)2 (14-15)

This should be compared to Equation 14-4 for the 1oo2 architecture. As an

architecture, the 1oo2D will offer better safety performance than the 1oo2
because only undetected failures are included in the PFD.

EXAMPLE 14-32

Problem: Two single safety controllers with diagnostic channels are

used in a 1oo2D architecture system. Using the failure rates for a
safety rated single board computer with a diagnostic channel from
Example 14-25, what is the approximate PFD?

Solution: Using Equation 14-15, PFD1oo2D = 0.00000318. This is

better than that of the 1oo2.

PFS Fault Tree for 1oo2D

Figure 14-35 shows that a 1oo2D architecture will fail safe if there is a com-
mon cause safe failure, a common cause dangerous detected failure, a
detected failure of both units, or a failure of both units in a safe undetected
mode. The approximation techniques can be used to generate a formula
for the probability of failing safe from this fault tree:

PFS1oo2D = (SDC + SUC + DDC) SD

+ (SDN RT + DDN RT)2 + (SUN MT)2 (14-16)

EXAMPLE 14-33

Problem: A system consists of two single safety controllers with

diagnostic channels wired in a 1oo2D configuration with failure rates
from Example 14-32. The average time to restart the process after a
shutdown is 96 hours. Average repair time for a detected failure that
does not shut down the process is 72 hours. The mission time is one
year (8760 hours). What is the approximate PFS?

Solution: Using Equation 14-16, PFS1oo2D = 0.00000524.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 14-35. 1oo2D PFS Fault Tree

Markov Model for 1oo2D

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

The Markov model for a 1oo2D system is shown in Figure 14-36. Four sys-
tem success states are shown: 0, 1, 2 and 3. State 1 represents a safe
detected failure or a dangerous detected failure. As with 2oo2D, the result
of any detected failure is the same since the diagnostic switch de-energizes
the output whenever a failure is detected.

Figure 14-36. Markov Model for 1oo2D

Copyright International Society of Automation

Another system success state, state 2, represents the situation in which one
controller has failed dangerous undetected. The system will operate
correctly in the event of a process demand because the other unit will
detect the failure and de-energize the load via its 1oo2 control lines (Figure
14-33).

The third system success state is shown in Figure 14-37. One unit has
failed with its output de-energized. The system load is maintained by the
other unit which will still respond properly to a process demand.

Figure 14-37. 1oo2D with Safe Undetected Failure in One Unit

In state 1 the system has degraded to 1oo1D operation. A second safe fail-
ure or a dangerous detected failure will fail the system safe. As with
1oo1D, a dangerous undetected failure will fail the system dangerous.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Here, however, the system fails to state 5 where one unit has failed
detected and one unit has failed undetected. Note that an assumption has
been made that all units will be inspected and tested during a service call
so an on-line repair rate exits from state 5 to state 0.

In state 2 one unit has failed with a dangerous undetected failure. The sys-
tem is still successful since it will respond to a demand as described above.
From this state, any other component failure will fail the system danger-
ous. In Figure 14-38, a dangerous undetected failure occurred first in the
top unit and a safe detected failure then occurred in the lower unit. Since it
is assumed that any component failure in a unit causes the entire unit to
fail, it must be assumed that the control line from the lower unit to the
upper unit will not work. Under those circumstances, the system will not

Copyright International Society of Automation

respond to a process demand and is considered failed dangerous. Since

the lower unit failed in a safe detected manner, the Markov model goes to
state 5.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 14-38. 1oo2D with Two Failures, State 5

In state 3 one unit has failed safe undetected. In this condition the system
has also degraded to 1oo1D operation. Additional safe failures (detected
or undetected) or dangerous detected failures will cause the system to fail
safe. An additional dangerous undetected failure will fail the system dan-
gerous taking the Markov model to state 6 where both units have an unde-
tected failure. Failures from this state are not detected until there is a
manual proof test.

The transition matrix for the Markov model of Figure 14-36 is:

SDN DDN DUN SUN SC DDC DUC

1 2 + 2 2 2 + 0
S DD DU
O 1 0 0 + 0
SD DD SU DU
0 0 1 0 0 + +
P =
S DD DU
0 0 0 1 + 0
SD 0 0 0 1 0 0
O 0 0 0 0 1 0
0 0 0 0 0 0 1

where represents the sum of the remaining row elements.

Copyright International Society of Automation

EXAMPLE 14-34

Problem: Using failure rate values for a single board safety rated
controller from Example 14-29, calculate the PFS and PFD of a
1oo2D system.

Solution: Substituting numerical values into the P matrix yields:

1oo2D Safety
P 0 1 2 3 4 5 6
0 0.999995115 0.000004637 0.000000035 0.000000163 0.000000049 0.000000000 0.000000000
1 0.013888889 0.986108644 0.000000000 0.000000000 0.000002449 0.000000018 0.000000000
2 0.000000000 0.000000000 0.999997533 0.000000000 0.000000000 0.000002366 0.000000101
3 0.000000000 0.000000000 0.000000000 0.999997533 0.000002449 0.000000000 0.000000018
4 0.010416667 0.000000000 0.000000000 0.000000000 0.989583333 0.000000000 0.000000000
5 0.013888889 0.000000000 0.000000000 0.000000000 0.000000000 0.986111111 0.000000000
6 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 1.000000000

Using successive matrix multiplication, the PFD at 8760 hours is

calculated to be 0.00000345 and the PFS at 8760 hours is calculated
to be 0.00000510.

Figure 14-39. 1oo2D Architecture with Comparison

1oo2D Architecture with Comparison

A comparison diagnostic feature can be added to any architecture with
two or more logic solvers. This has been done in actual implementations.
When the comparison diagnostic detects a mismatch, it indicates a fault
within the system. Figure 14-39 shows an interprocessor communication
path added to the 1oo2D architecture to provide comparison capability
between logic solvers. In systems with two logic solvers such as the

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

1oo2D, if the automatic self-diagnostics do not detect the fault the compar-
ison diagnostics in both units will de-energize the diagnostic switches
when a mismatch is detected in order to ensure safety. The notation C2
will be used to indicate this additional diagnostic coverage factor. This is
one of several variations of the 1oo2D architecture. Additional variations
and modeling are shown in Reference 3.

Figure 14-40. 1oo2D with Comparison PFD Fault Tree

PFD Fault Tree for 1oo2D with Comparison

The 1oo2D with comparison diagnostics fails dangerous only if both units
fail dangerous and that failure is not detected by the self-diagnostics in
either unit or by the comparison diagnostics. The fault tree for this varia-
tion is shown in Figure 14-40. A first order approximate PFD equation
developed from this fault tree is

PFD1oo2D = DUC MT + [(1 - C2)DUN MT)]2 (14-17)

This should be compared to Equation 14-15 for the 1oo2D architecture.

1oo2D with comparison diagnostics will offer better safety performance
than 1oo2D without comparison diagnostics because only failures unde-
tected by both the self-diagnostics and the comparison diagnostics are
included in the PFD.

EXAMPLE 14-35
Problem: Two single safety controllers with diagnostic channels are
used in a 1oo2D architecture system. Using the failure rates for a
safety rated single board computer with a diagnostic channel from
Example 14-25 and a coverage factor for comparison of 99.9%, what
is the approximate PFD?
Solution: Using Equation 14-17, PFD1oo2D = 0.0000000004 8760
+ [(1 - 0.999) 0.0000000176 8760] = 0.00000315.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure 14-41. 1oo2D with Comparison PFS Fault Tree

PFS Fault Tree for 1oo2D with Comparison

Figure 14-41 shows that a 1oo2D architecture will fail safe if there is a com-
mon cause safe failure or a common cause dangerous detected failure, if
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

either unit fails safe undetected, if both units fail safe, or if there is an
undetected failure that causes a comparison mismatch. Note that the fault
tree shows an incomplete symbol, which indicates that the failures not
detected by self-diagnostics or by the comparison diagnostics are not
included in the tree since they are orders of magnitude smaller and there-
fore considered insignificant. The first order approximation techniques
can be used to generate a formula for the probability of the system failing
safe from this fault tree:

PFS1oo2D = (SDC+SUC+DDC+2C2SUN+2C2DUN) SD
+ (SDN RT+DDN RT)2 (14-18)

EXAMPLE 14-36

Problem: A system consists of two single safety controllers with

diagnostic channels wired in a 1oo2D configuration with failure rates
and comparison diagnostic coverage from Example 14-35. The
average time to restart the process after a shutdown is 96 hours.
Average repair time for a detected failure that does not shut down the
process is 72 hours. The mission time is one year (8760 hours). What
is the approximate PFS?

Solution: Using Equation 14-18, PFS1oo2D = 0.00002364.

Copyright International Society of Automation

Figure 14-42. Markov Model for 1oo2D with Comparison

Markov Model for 1oo2D with Comparison

The Markov model for a 1oo2D system with comparison is shown in Fig-
ure 14-42. Four system success states are shown. The states are much the
same as for the 1oo2D architecture. The major difference is the failures

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
detected by the comparison. Previously undetected safe and dangerous
failures detected by the comparison diagnostics are added to the arc from
state 0 to state 4. Only failures undetected by the comparison diagnostics
will cause a transition to states 2 and 3.

The transition matrix for the Markov model of Figure 14-42 is:

SDN DDN DUN SUN SC DDC SUN DUN DUC

1 2 + 2 2 ( 1 C ) 2 ( 1 C ) + + 2C ( + ) 0
2 2 2
S DD DU
O 1 0 0 + 0
SD DD SU DU
P= 0 0 1 0 0 + +
S DD DU
0 0 0 1 + 0
0 0 0 1 0 0
SD
0 0 0 0 1 0
O
0 0 0 0 0 0 1

where represents the sum of the remaining row elements.

Copyright International Society of Automation

EXAMPLE 14-37

Problem: Using failure rate values from Example 14-29, calculate the
PFS, and PFD of a 1oo2D system with comparison diagnostics.

Solution: Substituting numerical values into the P matrix yields:

1oo2D Safety with Comparison

P 0 1 2 3 4 5 6
0 0.999995115 0.000004637 0.000000000 0.000000001 0.000000246 0.000000000 0.000000000
1 0.013888889 0.986108644 0.000000000 0.000000000 0.000002449 0.000000018 0.000000000
2 0.000000000 0.000000000 0.999997533 0.000000000 0.000000000 0.000002366 0.000000101
3 0.000000000 0.000000000 0.000000000 0.999997533 0.000002449 0.000000000 0.000000018
4 0.010416667 0.000000000 0.000000000 0.000000000 0.989583333 0.000000000 0.000000000
5 0.013888889 0.000000000 0.000000000 0.000000000 0.000000000 0.986111111 0.000000000
6 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 1.000000000

The PFD at 8760 hours is calculated to be 0.000003154 and the PFS

at 8760 hours is calculated to be 0.000023683.

Comparing Architectures
When the results of the example calculations are examined, several of the
main features of different architectures become apparent. The results for
the four classic architectures (1oo1, 1oo2, 2oo2 and 2oo3) implemented
with a conventional micro PLC are compiled in Table 14-2. Of the four
architectures, the highest safety rating goes to the 1oo2 architecture, with
the lowest PFD of 0.00013. The 2oo3 architecture also does well, with a
PFD of 0.00031. Note that PFD for the 2oo3 is roughly three times worse
than 1oo2. This is to be expected as there are three sets of parallel switches
in 2oo3. It should also be noted that all PFD and PFS results will certainly
be different for different failure rates and other parameter values.

Of the classic architectures, 2oo2 has the lowest PFS as expected. Even so,
the 2oo2 PFS should theoretically be much lower. However, a close exami-
nation of the PFS equation shows that a safe undetected failure in one unit
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

would remain failed for the entire mission time. When this happens a sec-
ond safe failure in the other unit causes system failure. This situation dom-
inates the PFS result and clearly shows that automatic diagnostics are as
important as the architecture.

Table 14-2. Single Board Controller Model Results

Classic Architectures - Conventional PLC Fault Tree Results
1oo1 1oo2 2oo2 2oo3
PFD 0.005278 0.000132 0.010450 0.000315
PFS 0.000107 0.000213 0.000058 0.000217

Table 14-3 compares the results for the four classic architectures using the
single safety controller models. It is interesting to note that the fault tree

Copyright International Society of Automation

results and the Markov results are similar for this set of parameters. The
failure rates used in the examples are sufficiently small to allow the first
order approximation to be reasonably accurate.

Table 14-3. Single Safety Controller Model Results

Classic Architectures - Safety PLC Fault Tree Results
1oo1 1oo2 2oo2 2oo3
PFD 0.00021773 0.00000440 0.00043110 0.00000822
PFS 0.00015216 0.00030128 0.00000336 0.00000826

A comparison of all results for the safety PLC is shown in Table 14-4. It is
interesting to note that the fault tree results (denoted by the ft letters) and
Markov results (denoted by the mm letters) are practically identical, with

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
the Markov providing slightly lower numbers. This is to be expected as
the Markov numerical solution technique provides more precision than
the approximation techniques. The author favors the Markov approach,
not because of these differences in the results, but because the Markov
model development is more systematic. At least for the author, with
techniques other than Markov models it is too easy to neglect
combinations of two or more failures that might have an impact for some
ranges of parameters.

Comparing the architectures themselves, the 1oo1D does provide better

safety than the 1oo1 but at the cost of a higher false trip rate. It is not likely
that this small improvement is optimal in the process industries where a
trip might even cause a dangerous condition. Therefore the addition of
watchdog timers that de-energize the output of safety PLC circuits is
not a good idea.

The lowest PFD is provided by the 1oo2D with comparison diagnostics.

That is one of the reasons this architecture was chosen for so many com-
mercial safety PLC implementations during the 1990s. However, the low
PFD is offset by an increased false trip rate (higher PFS) when compared
to 1oo2D. An optimal architecture design with good self-diagnostics
would not need the comparison.

The architectures 2oo3, 1oo2D, and 1oo2D with comparison diagnostics all
provide excellent safety. 2oo2, 2oo2D, 2oo3, 1oo2D, and 1oo2D with com-
parison diagnostics provide excellent operation without an excessive false
trip rate (low PFS).

Copyright International Society of Automation

Table 14-4. Single Safety Controller Model Results for All Architectures
Architecture Comparison - 61508 Certified Safety PLC
1oo1 1oo2 2oo2 1oo1D 2oo3 2oo2D 1oo2D 1oo2DComp.
PFDft 0.00021773 0.00000440 0.00043110 0.00015896 0.00000822 0.00031467 0.00000318 0.00000315
PFSft 0.00015216 0.00030128 0.00000336 0.00023510 0.00000826 0.00000548 0.00000524 0.00002364
PFDavgft 0.00013889
PFDmm 0.00021766 0.00000440 0.00043076 0.00015827 0.00000822 0.00031194 0.00000345 0.00000315
PFSmm 0.00015210 0.00030112 0.00000321 0.00023500 0.00000665 0.00000510 0.00000510 0.00002368
PFDavg mm 0.00013833 0.00000279 0.00027380 0.00007903 0.00000521 0.00015600 0.00000166 0.00000158

The 2oo2D architecture can provide the best compromise if the automatic
diagnostics are excellent. This level of automatic diagnostics is being
achieved in new designs, especially those with microprocessors specifi-
cally designed for use in IEC 61508 certified systems. This new approach is
being pursued because failure probability comparison results like Table
14-4 show the results to be superior to the traditional approach of using
2oo3. When new architectures are used to provide better overall designs,
the value of probabilistic analysis as a design tool is clear.

Exercises
14.1 A safety instrumented function (SIF) uses three analog input chan-
nels, two digital input channels, and two digital output channels

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
on a conventional micro PLC controller. The following dangerous
failure rates are given:
Analog Input Channel dangerous failure rate = 7 FITS
Digital Input Channel dangerous failure rate = 24 FITS
Digital Output Channel dangerous failure rate = 34 FITS
Common and Main Processing Circuits dangerous
failure rate = 355 FITS
What is the total dangerous failure rate failure rate for the portion
of the controller used in the SIF?
14.2 We are given the following coverage factors for the single board
controller of Exercise 14.1:
Analog Input Channel = 0.97
Digital Input Channel = 0.99
Digital Output Channel = 0.95
Main Processing Circuits = 0.99
What are the DD and DU failure rates of all failure categories when
the board is used in a 1oo1 system configuration?
14.3 Two redundant digital input channels are implemented on a com-
mon circuit board and a beta factor of 3% is estimated. If the DU
failure rate of digital input circuit is 0.24 FITS, what are the DUC
and DUN failure rates?

Copyright International Society of Automation

14.4 Which system configurations are designed to avoid failures in

which the outputs are energized (lowest PFD)?
14.5 Which system configurations have the lowest PFS?
14.6 Is there an advantage in having a maintenance policy that asks
repair persons to check all controllers when making a repair trip?
14.7 Annunciation failures result in the loss of an automatic diagnostic
or the diagnostic annunciation function. When should annuncia-
tion failures be modeled in a probabilistic analysis?
14.8 The D in the architecture name 1oo1D means:
a. the architecture uses a microprocessor with internal automatic
diagnostics
b. the architecture cannot fail dangerous
c. the architecture is a derivative of the 1oo1
d. the architecture includes a separate output switch controlled
by automatic diagnostics.

Answers to Exercises
14.1 The total dangerous failure rate is 3 7 + 2 24 + 355 + 2 34 = 492
FITS.
14.2 AI DD = 0.97 7 = 6.8 FITS
AI DU = (1 - 0.97) 7 = 0.2 FITS
DI DD = 0.99 24 = 23.8 FITS
DI DU = (1 - 0.99) 24 = 0.2 FITS
Com DD = 0.99 355 = 351.4 FITS
Com DU = (1 - 0.99) 355 = 3.6 FITS
DO DD = 0.95 34 = 32.3 FITS
DO DU = (1 - 0.95) 34 = 3.6 FITS
14.3 DI DUC = 0.03 0.24 = 0.0072 FITS
DI DUN = (1-0.03) 8 0.24 = 0.2328 FITS
14.4 1oo2, 2oo3, 1oo2D, 1oo2D/Comparison. 2oo2D can achieve high
safety (low PFD) if it has effective automatic diagnostics.
14.5 2oo2, 2oo2D, 2oo3, 1oo2D, 1oo2D Comparison
14.6 Yes, previously undetected failures will be detected and repaired.
This results in higher safety and higher availability in redundant
systems.
14.7 AU modeling must be done when automatic diagnostic coverage
exceeds 98%. AU modeling should be considered when automatic
diagnostic coverage exceeds 95%.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

14.8 Answer d. The D in the architecture name 1oo1D means that the
architecture has an independent switch to de-energize the output
when a failure is detected by the automatic diagnostics.

References
1. Goble, W. M. Evaluating Control System Reliability: Techniques and
Applications, First Edition. Research Triangle Park: ISA, 1992.

2. Goble, W. M. and Bukowski, J. V. Extending IEC 61508 Reliability

Evaluation Techniques to Include Common Circuit Designs Used
in Industrial Safety Systems. Proceedings of the Annual Reliability
and Maintainability Symposium. NY: IEEE, 2001.

3. Goble, W. M. The Use and Development of Quantitative Reliability and

Safety Analysis in New Product Design. Sellersville: exida, 2000.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
15
Safety
Instrumented
Systems
As defined by ANSI/ISA-84.00.01-2004 (IEC 61511 Mod) [Ref. 4] a Safety
Instrumented System (SIS) is a control system consisting of sensors, one or
more controllers (frequently called logic solvers), and final elements. The
purpose of an SIS is to monitor an industrial process for potentially danger-
ous conditions and to alarm or execute preprogrammed action to either pre-
vent a hazardous event from occurring or to mitigate the consequences of a
hazardous event should it occur. An SIS:

Does not improve the yield of a process and

Does not increase process efficiency but
Does save money by loss reduction and
Does reduce risk cost.

Risk Cost
Risk is usually defined as the probability of a failure event multiplied by
the consequences of the failure event. The consequences of a failure event
are measured in terms of risk cost. The concept of risk cost is a statistical
concept. An actual cost is not incurred each year. Actual cost is incurred
only when there is a failure event (an accident). The individual event cost
can be quite high. If event costs are averaged over many sites for many
years, an average risk cost per year can be established. If actions are taken
to reduce the chance of a failure event or the consequences of a failure
event, risk costs are lowered.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
359
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
360 Control Systems Safety Evaluation and Reliability

EXAMPLE 15-1

Problem: Records maintained over many sites for many years

indicate that on average once every 15 years an industrial boiler has
an accident if no protection equipment like an SIS is used. The
average cost of each event is $1 million. What is the yearly risk cost
with no protection equipment?

Solution: The average yearly risk cost is $1,000,000/15 = $66,667.

Risk Reduction
There are risks in every activity of life. Admittedly some activities involve
more risk than others. According to Reference 1, the chance of dying dur-
ing a 100 mile automobile trip in the midwestern United States is 1 in
588,000. The average chance each year of dying from an earthquake or vol-
cano is 1 in 11,000,000.

There is risk inherent in the operation of an industrial process. Sometimes

that risk is unacceptably high. A lower level of risk may be required by
corporate rules, regulatory environment, law, the insurance company,
public opinion, or other interested parties. This requirement leads to the
concept of acceptable risk. When inherent risk (perceived or actual) is
higher than acceptable risk, then risk reduction is required (Figure 15-1).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 15-1. Risk Reduction

While inherent risk and even acceptable risk are very hard to quan-
tify, risk reduction is a little easier. Several methods have been proposed
to determine the amount of risk reduction to at least to an order-of-magni-
tude level (Ref. 2).

Copyright International Society of Automation

Risk Reduction Factor

The risk reduction factor (RRF) may be defined as:

RRF = Inherent Risk / Acceptable Risk

An SIS provides risk reduction when it monitors a process looking for a

dangerous condition (a process demand) and successfully performs its
preprogrammed function to prevent an event. Assuming that the SIS has
been properly programmed, it reduces risk whenever it operates success-
fully in response to a process demand. It will not reduce risk if it fails to
operate when there is a process demand. Therefore, an important measure
of the risk reduction capability of an SIS is PFD, probability of failure on

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
demand. In the case of a de-energize-to-trip system, this is the probability
that the system will fail with its outputs energized. For low demand sys-
tems, a dangerous condition occurs infrequently therefore PFDavg (Chap-
ter 4) is the relevant measure of probability of dangerous failure. In such
cases the risk reduction factor (RRF) achieved is defined as:

RRF = 1 / PFDavg (15-1)

How Much RRF is Needed?

Some find it hard to numerically estimate the necessary RRF in an indus-
trial process. That is one reason why functional safety standards provide
an order-of-magnitude framework with which to work. This order of mag-
nitude risk reduction framework is called safety integrity level (SIL).
Figure 15-2 shows the safety integrity levels (SIL) established by IEC 61508
(Ref. 3) with some examples of various industrial processes. It should be
noted that any particular industrial process can be assigned different SILs
depending on the actual or perceived effect of events that are known (or
likely) to occur in connection with that process.

Several methods to determine SIL are published in Part 3 of ISAs 84.01

standard (Ref. 4). One qualitative method from 84.01, Part 3 was devel-
oped to deal with personnel death and injury is shown in Figure 15-3. This
method is called a risk graph. The developer of a risk graph must deter-
mine four things: the consequence (C), the frequency of exposure (F), the
possibility of avoidance (P), and the probability of occurrence (W).

The probability of occurrence of an event is characterized by three of the

factors: F, P, and W. The fourth factor, C, completes the estimate of risk. A
description of each factor is shown in Table 15-1.

Copyright International Society of Automation

Average Probability of
Safety Integrity Failure on Demand Risk Reduction Factor Typical Applications
Level (PFDavg) Low Demand (RRF)

4 < 0.0001 > 10,000

Rail Transportation

3 0.001 - 0.0001 1,000 - 10,000 Utility Boilers

2 0.01 - 0.001 100 - 1,000 Industrial Boilers

Chemical
Processes

1 0.1 - 0.01 10 - 100

Figure 15-2. Risk Reduction Categories

W3 W2 W1
C1
NSS NS NS

P1
F1 SIL1 NSS NS
C2 P2
F2 P1 SIL2 SIL1 NSS
P2
C3 F1
F2 P1 SIL3 SIL2 SIL1
P2
C4 F1
P1 SIL4 SIL3 SIL2
F2
P2
NPES SIL4 SIL3
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

NS - No Safety Requirements
NSS - No Special Safety Requirements
NPES - Single SIS Insufficient

Figure 15-3. Risk Graph SIL Determination Method

Copyright International Society of Automation

Table 15-1. Description of Risk Graph Factors

Consequence
C1 Minor Injury
C2 Serious injury or single death
C3 Death to multiple persons
C4 Very many people killed
Frequency of Exposure and Time
F1 Rare to frequent
F2 Frequent to continuous exposure
Possibility of Avoidance
P1 Avoidance possible
P2 Avoidance not likely, almost impossible
Probability of Unwanted Occurrence
W1 Very slight probability
W2 Slight probability, few unwanted occurrences
W3 High probability

The consequences refer specifically to personnel injury or death. The fre-

quency of exposure is a measure of the chances that personnel will be in
the danger area when an event occurs. If a process is only operated once a
week and no operators are normally at the site, this constitutes a relatively
low risk compared to continuous operation with personnel always on
dutygiven that both processes present similar inherent risks.

The possibility of avoidance considers such things as warning signs, the

speed at which an event will develop, and other avoidance measures such
as protective barriers. The probability of unwanted occurrence parameter
includes consideration of other risk reduction devices.

The danger area is also an important consideration. If there is only risk of

injury near the process unit then perhaps only the operators and other
plant personnel need be considered in the analysis. But if a hazardous
event could harm people in the neighborhood or over a wide geographic
area, then that must be considered in the analysis.

EXAMPLE 15-2

Problem: An industrial process may need an SIS to reduce risk to an

acceptable level. Use the risk graph of Figure 15-3 to determine this
need.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`

Copyright International Society of Automation

EXAMPLE 15-2 continued

Solution: Experts estimate that an event may result in serious injury

or a single death, C2. The process operates continuously and
personnel are usually present; therefore, exposure is rated as
frequent to continuous, F2. Since there is little or no warning of a
dangerous condition the probability of avoidance is rated avoidance
not likely, P2. Finally, it is judged that there is only a slight probability
of occurrence, W2, because pressure relief valves are installed to
provide another layer of protection. Using the risk graph, this process
needs a SIL2 level of risk reduction.

Many corporations have versions of risk graphs in corporate procedures.

Frequently, these documents provide detailed descriptions of how to clas-
sify the various parameters. In addition to personnel risk, risk graphs are
needed for equipment damage and environmental damage (Ref. 5). When
all three graphs are completed for a process the largest risk reduction need
is the one specified for the SIL (Figure 15-4).

Figure 15-4. Screen ShotRisk Graph for Personnel, Equipment, and Environmental
(Ref. 5)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Another risk reduction determination method published in ISA-84.01 is

called Layer of Protection Analysis (LOPA). This method considers all the
valid, independent layers of protection and credits those layers for the risk
reduction provided. LOPA can be used qualitatively or semi-quantita-
tively with most practitioners using a semi-quantitative approach as
shown in Figure 15-5.

Figure 15-5. Screen ShotFrequency Based LOPA Results Example (Ref. 5)

Quantitative Risk Reduction

In some cases corporations and insurance companies determine risk
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

reduction factors based on quantitative risk analysis. Statistics are com-

piled on the frequency and severity of events. These statistics are used to
determine risk reduction factors based on risk cost goals or event probabil-
ity goals. While these methods may seem easy, it must be understood that
the uncertainty of the statistical data and the variability of the factors con-
tributing to an event must be taken into account. This is typically done by
increasing the required risk reduction factor by a certain safety margin.

Copyright International Society of Automation

EXAMPLE 15-3

Problem: The risk cost of operating an industrial boiler is estimated

to be $66,667 per year. The insurance company that insures the
operation of the boiler needs a risk cost of less than $1,000 per year
in order to remain profitable without a significant increase in
premiums. What is the needed risk reduction factor?

Solution: The ratio of risk costs is 66667/1000 which equals 66.7. In

order to account for the uncertainly of the data, the insurance
company chooses a higher number and mandates that a safety
instrumented function (SIF) with an RRF of 100 be installed in the
SIS. (Note: this is a SIL2 category SIF).

SIS Architectures
An SIS consists of three categories of subsystems: sensors/transmitters,
controllers, and final elements, that work together to detect and prevent,
or mitigate the effects of a hazardous event. It is important to design a sys-
tem that meets RRF requirements. It is also important to maximize the
production uptime (minimize false trips). In order to achieve these goals,
system designers often use redundant equipment in various architectures
(Chapter 14). These fault tolerant architectural configurations apply to
field instruments (sensors/transmitters and final elements) as well as to
controllers.

Sensor Architectures
An SIS must include devices capable of sensing potentially dangerous
conditions. There are many types of sensors used including flame detec-
tors (infrared or ultraviolet), gas detectors, pressure sensors, thermocou-
ples, RTDs (resistance temperature detectors), and many types of discrete
switches. These sensors can fail, typically in more than one failure mode.
Some sensor failures can be detected by on-line diagnostics in the sensor
itself or in the controller to which it is connected. In general, all of the reli-
ability and safety modeling techniques can be used.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

EXAMPLE 15-4

Problem: A two-wire pressure sensor transmits 420 mA to the

analog input of a trip amplifier that is configured to de-energize a
relay contact when the mA input current from the sensor goes higher
than the trip point. The sensor manufacturer supplies the following
failure rate data (Ref. 6):

Fail Danger Detected High = 59 FITS

Fail Danger Detected Low = 33 FITS

Copyright International Society of Automation

EXAMPLE 15-4 continued

Fail Danger Detected Diagnostics = 264 FITS

Fail Danger Undetected = 37 FITS
Fail Annunciation Undetected = 5 FITS

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The pressure transmitter is configured to send its output to 20.8 mA
(over-range) if a failure is detected by the internal diagnostics. The
transmitter is fully tested and calibrated every five years. In this
application, what is the PFDavg for a single transmitter in a 1oo1
architecture for a five year mission time? What is the Spurious Trip
Rate (STR)?

Solution: The trip amplifier will falsely trip if the transmitter fails with
its output signal high. The system will fail dangerous if the transmitter
fails with its output signal low or if it has a dangerous undetected
failure. The failure rates are therefore:

S = (59 + 264) 10-9 = 3.23 10-7 failures per hour

D = (33 + 37) 10-9 = 7.0 10-8 failures per hour

No automatic diagnostics are given credit in the PFD calculation

since there is no diagnostic annunciation mechanism configured in
the controller. Therefore, using equation 14-2, PFDavg = DU 5
8760 / 2 = 0.0015.

The term Spurious Trip Rate refers to the average rate at which a
subsystem will cause a shutdown when no dangerous condition
occurs. The STR is equal to the safe failure rate of 3.23 10-7 trips
per hour.

EXAMPLE 15-5

Problem: A two-wire pressure sensor transmits 420 mA to the

analog input of a safety certified logic solver. The sensor
manufacturer supplies a safety certificate with the term Systematic
Capability = SIL 3. The sensor manufacturer also supplies the
following failure rate data (Ref. 6):

Fail Danger Detected High = 59 FITS

Fail Danger Detected Low = 33 FITS
Fail Danger Detected Diagnostics = 264 FITS
Fail Danger Undetected = 37 FITS
Fail Annunciation Undetected = 5 FITS

The logic solver is programmed to recognize an out-of-band current

signal as a diagnostic fault and will hold the last pressure value while
the failure is annunciated and repaired. Average repair time at the
facility is 168 hours (one week). The transmitter is fully tested and
calibrated every five years. What is the PFDavg for a single

Copyright International Society of Automation

EXAMPLE 15-5 continued

transmitter in a 1oo1 architecture for a five year mission time? What

is the Spurious Trip Rate (STR)? What does the term Systematic
Capability mean?

Solution: The logic solver will detect out-of-range current signals and
hold the last pressure value. Therefore, these failures are dangerous
detected (DD). The total DD failure rate is 356 FITS (One FIT = 1
10-9 failures per hour). Using Equation 14-2,

PFDavg = 3.5610-7 168 + 3.710-8 8760 5 / 2 = 0.00087

Spurious Trip Rate: Because the logic solver is configured to hold the
last pressure reading on failure of the sensor, no false trips will occur.
The STR is zero.

The term Systematic Capability in an IEC 61508 certification means

that the design, test, and manufacturing processes used to create
and build the product have a level of integrity needed for SIL 3 to
reduce design and manufacturing faults inside the product. This
allows the designer of a Safety Instrumented System to use that
component at up to and including the rated SIL level. A product with a
systematic capability rating of SIL 3 can be used in SIL 1, SIL 2, or
SIL 3 without safety integrity justification but cannot be used in a SIL
4 application without further justification based on extensive prior use.

Figure 15-6 shows two discrete sensors measuring the same process vari-
able. These two sensors can be configured in a 1oo2 architecture by simply
adding logic to initiate a shutdown if either of the two sensors signals a
dangerous condition. Like the 1oo2 controller architecture (Chapter 14),
this configuration will substantially reduce the chance of a dangerous fail-
ure but will almost double the chance of a safe failure. Note that common
cause failures apply when redundant configurations are used. The com- --``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

mon cause defense rules (Chapter 10; Common-Cause Avoidance) apply.

Avoid close physical installation, use high strength sensors, use diverse
design sensors, or some combination of all three techniques.

EXAMPLE 15-6

Problem: Two pressure switches measure the same process

variable. They are connected to a safety certified logic solver with
logic to perform a 1oo2 vote. The manufacturer supplies a certificate
with the term Systematic Capability = SIL 3. The manufacturer also
supplies the following failure rate data (Ref. 7):

Fail Safe = 83.8 FITS

Fail Danger Undetected = 61.3 FITS

Copyright International Society of Automation

EXAMPLE 15-6 continued

What is the PFDavg for a 1oo2 sensor subsystem for a five year
mission time? What is the Spurious Trip Rate for the sensor
subsystem?

Solution: With a common-cause beta factor of 10% (0.10), the failure

rates are:

SUC = 0.1 83.8E-9 = 8.4E-9 failures per hour

SUN = (1-0.1) 83.8E-9 = 7.54E-08 failures per hour
DUC = 0.1 61.3E-9 = 6.1E-9 failures per hour
DUN = (1-0.1) 61.3E-9 = 5.52E-08 failures per hour

Since a common cause safe failure or a safe failure in either switch

will cause a false trip, the STR equals 8.4E-9 + 7.54E-8 + 7.54E-8,
which totals 1.59E-7 trips per hour.

PFDavg can be calculated using Equation 14-5. Note that there are
no detected failures, therefore RT (average repair time) = 0. PFDavg
is 0.00014.

Discrete Sensor

Logic Solver
+ Discrete Input
Pressure
Sensor -
1oo2 logic
trip if either
Discrete Sensor sensor
indicates a trip is
needed
Pressure + Discrete Input
Sensor -

Figure 15-6. 1oo2 Discrete Sensors

The 1oo2 sensor concept can be applied to analog sensors as well. Figure
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

15-7 shows two analog sensors measuring the same process variable. A
high select or low select function block (depending on the fail-safe
direction) is used to select which analog signal will be used in the
calculation.

Copyright International Society of Automation

Analog Transmitter
4 to 20 mA
Logic Solver

Pressure
+ Analog Input
Transmitter -
1oo2 Logic -
high or low
Analog Transmitter
4 to 20 mA select
depending on
Pressure + Analog Input trip function
Transmitter -

Figure 15-7. 1oo2 Analog Sensors

EXAMPLE 15-7

Problem: A two wire pressure sensor transmits 4 20 mA to the

analog input of a safety certified logic solver. Two of these sensors
measure the same process variable and are configured to trip if either
sensor indicates a trip (1oo2 logic). The sensor manufacturer
supplies a safety certificate with the term Systematic Capability =
SIL 3. The manufacturer also supplies the following failure rate data
(Ref. 6):

Fail Danger Detected High = 59 FITS

Fail Danger Detected Low = 33 FITS
Fail Danger Detected Diagnostics = 264 FITS
Fail Danger Undetected = 37 FITS
Fail Annunciation Undetected = 5 FITS

The logic solver is programmed to recognize an out of range current

signal as a diagnostic fault and will hold last pressure value while the
failure is annunciated and repaired. Average repair time at the facility
is 168 hours (one week). The transmitter is fully tested and calibrated
every five years. What is PFDavg for a 1oo2 sensor subsystem
architecture for a five year mission time? Assume a common cause
beta factor of 10%. What is the Spurious Trip Rate (STR) of the
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

sensor subsystem?

Solution: The logic solver will detect out-of-range signals and hold
the last pressure value. Therefore, these out-of-range failure rates
are classified as dangerous detected (DD). The total DD failure rate
for each sensor is 356 FITS. Considering common cause, the failure
rates are:

DDC = 0.1 3.56E-7 = 3.56E-8 failures per hour

DDN = (1-0.1) 3.56E-7 = 3.2E-7 failures per hour

Copyright International Society of Automation

EXAMPLE 15-7 continued

DUC = 0.1 3.7E-8 = 3.7E-9 failures per hour

DUN = (1-0.1) 3.7E-8 = 3.33E-8 failures per hour

Using Equation 14-5,

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PFDavg = 0.000088
STR = 0.

The previous examples show different approaches to the design of sensor

subsystems. The PFDavg and STR of these designs are shown in Table
15-2.

Table 15-2. Sensor Subsystem Results

Architecture PFDavg STR
1oo1 Analog Sensor/Trip Amp 0.0016 3.23-7
1oo1 Analog Sensor/Logic Solver 0.00087 0
1oo2 Switch 0.00014 1.67-7
1oo2 Analog Sensor/Logic Solver 0.000088 0

The results show the significant advantage of using an analog sensor,

especially if a logic solver configured to detect out of range current signals
is used. This approach yields good safety as long as the repair time is short
(less than one week per this example) and there is exceptional process
availability.

Many other sensor architectures are possible. A common approach in the

days of relay logic was to use three transmitters in a 2oo3 architecture.
This worked well to achieve good process availability and safety when
automatic diagnostics were not available, and this approach is still being
used. Other possible architectures are 2oo5 or the general MooN, used
most often in fire and flammable gas applications. Note that some sensor
architectures are not what they seem. In one installation, two toxic gas
detectors are installed in each zone. Ten zones exist. At first look this
might be described as a 2oo20 vote. The vote-to-trip logic in the logic
solver requires a 2oo2 signal to trip from the sensor pair within each zone.
But if any pair in any zone indicates a trip signal, the logic will initiate a
trip. That portion of the logic appears to be 1oo10. However, the probabil-
ity model is NOT 1oo10. A question might be asked, Why 10 sets of sen-
sors? The answer is that one set could not be certain of detecting the
hazard caused by the toxic gas. Given that answer, there are 10 safety
functions, each of which has a 2oo2 sensor. There is no safety redundancy
benefit from the 10 sets of sensors if each set is not able to independently
detect the toxic gas hazard.

Copyright International Society of Automation

Final Element Architectures

Final elements are the third major category of subsystems in an SIS. Final
elements in the process industries are typically a remote-actuated valve
consisting of an interface component (a solenoid-operated pneumatic
valve), an actuator, and a process valve. Final elements have failure
modes, failure rates, and (in some devices) on-line diagnostics. The reli-
ability and safety modeling techniques developed for other SIS devices are
appropriate.

The simplest final element configuration is the 1oo1 as shown in Figure

15-8. The controller normally energizes a solenoid valve that supplies air
under pressure to the actuator. The actuator keeps the valve open (or
closed, if that is the desired position). When an unsafe condition is
detected, the controller de-energizes the solenoid valve. The valve stops
the compressed air delivery and vents air from the actuator. The spring
return actuator then moves the valve to its safe position. In such final ele-
ment systems, failure rates must be obtained for the solenoid valve and
the actuator/ valve assembly.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Air Supply

Controller Discrete Output Vent

Solenoid Valve
Actuator

Valve

Figure 15-8. 1oo1 Final Element Assembly

Copyright International Society of Automation

EXAMPLE 15-8

Problem: Solenoid valve failure rates provided by the manufacturer

from a third party FMEDA are S = 71 FITS and D = 100 FITS (Ref.
8). Actuator failure rates from a third party FMEDA are S = 172 FITS
and D = 343 FITS (Ref. 9). Valve failure rates from a third party
FMEDA are S = 0 FITS and D = 604 FITS (Ref. 10, full stroke). No
automatic diagnostics are available. What is the total safe and
dangerous failure rate of the final assembly? What is the STR? What
is the PFDavg if the final element assembly is removed from service
and tested/rebuilt every five years?

Solution: The solenoid valve, actuator, and valve comprise a series

system so the failure rates are added. The total is S = 243 FITS and
D = 1047 FITS. The STR equals the safe failure rate: STR = 2.43-7
trips per hour. Equation 14-2 can be used to approximate the
PFDavg.

Substituting the failure rates,

PFDavg = 0.000001047 5 8760/2 = 0.023

Figure 15-9 shows a common implementation of a 1oo2 configuration for

final elements. The valves close when a dangerous condition is detected.
The system is considered successful if either valve successfully closes. This
configuration is intended to provide higher safety than a single-valve sys-
tem and can be effective especially if common cause is applied.

Air Supply Air Supply

Controller Discrete Output

Discrete Output

Valve Valve
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure 15-9. 1oo2 Final Element Subsystem

Copyright International Society of Automation

EXAMPLE 15-9

Problem: Using the failure rates of Example 15-8, what are the STR
and PFDavg of a 1oo2 final element assembly? Assume a common-
cause beta factor of 10%. No diagnostics are available. The final
element assembly is removed from service and tested/rebuilt every
five years.

Solution: The failure rates are:

SUC = 24 FITS
SUN = 219 FITS
DUC = 105 FITS
DUN = 942 FITS
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Using a first order approximation, Equation 14-5, the PFDavg of this

system is:

PFDavg = ((0.000000942)2 (5 8760) 2)/3 + 0.000000105 5

8760/2 = 0.0029. This is considerably better than the PFDavg of
Example 15-8.

STR = (24+219+219) 10-9 = 0.000000462 trips per hour.

Safety Instrumented Function (SIF) Components

An SIF consists of a process connection, sensor, power supplies, controller,
and final element. The process to process approach must be used to
ensure that the safety and reliability analysis includes all components,
which may include a sensor impulse line (for pressure or vacuum) and
possibly a manifold. When modeling entire SIFs (Ref. 11), all components
needed for the protection function must be modeled.

EXAMPLE 15-10

Problem: A pressure sensor is connected to a process via an

impulse line. Records indicate that the impulse line clogs up, on
average, once every twenty years. This condition is dangerous as the
protection system may not respond to a demand. An algorithm based
on high speed statistical analysis of the pressure signal (Ref. 12) can
detect 80% of the clogs. When a clog is detected it is repaired in 168
hours on average. If the impulse line is inspected and cleaned out
every five years, how will the diagnostic algorithm affect the PFDavg?

Solution: Assuming a constant failure rate for the impulse line, the
failure rate is calculated using Equation 4-18. = 1/(20 8760) =
0.0000057 failures per hour. Equation 14-2 can be used to
approximate the PFDavg. Without the diagnostic algorithm the entire
failure rate must be classified as dangerous undetected.

Copyright International Society of Automation

EXAMPLE 15-10 continued

Therefore, the PFDavg impulse = 0.0000057 5 8760 / 2 = 0.125.

With the diagnostic algorithm, the failure rate is divided into
dangerous detected (0.0000046 failures per hour) and dangerous
undetected (0.0000011 failures per hour). Using Equation 14-2, the
PFDavg impulse = 0.00000046 168 + 0.0000011 5 8760 /2 =
0.026. The diagnostics provide a considerable improvement.

Exercises
15.1 A process is manned continuously and has no risk reduction mech-
anism. An accident could cause death to multiple persons. Danger-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
ous conditions do build slowly and alarm mechanisms should
warn of dangerous conditions before an accident. Using a risk
graph, determine how much risk reduction is needed.
15.2 A quantitative risk assessment indicates an inherent risk cost of
$250,000 per year for an industrial process. Plant management
would like to reduce the risk cost to less than $25,000 per year.
What risk reduction factor is required? What SIL classification is
this?
15.3 What components must be considered in the analysis of an SIF?
15.4 A process connection clogs every year on average. This is a danger-
ous condition. No diagnostics can detect this failure. Assuming a
constant failure rate, what is the dangerous undetected failure
rate?
15.5 Using the failure rate of Exercise 15.4, what is the approximate
PFDavg for a three-month inspection interval in a 1oo1
architecture?

Copyright International Society of Automation

Answers to Exercises
15.1 Death to multiple persons is classified as a C3 consequence. The
frequency of exposure is continuous, F2. Alarms give a possibility
of avoidance, P1. Since no protection equipment is installed the
probability of unwanted occurrence is high, W3. The risk graph
indicates SIL3. The risk reduction factor needed for an SIF would
be in the range of 1,000 to 10,000.
15.2 The necessary risk reduction factor is 250,000/25,000 = 10. This is
classified as SIL1.
15.3 SIF reliability and safety analysis should consider all components
from sensor process connection to valve process connection. Typi-
cally this includes impulse lines, manifolds, sensors, controllers,
power supplies, solenoids, air supplies, valve actuators, and valve
elements. If communications lines are required for safe shutdown
by an SIS then they must be included in the analysis.
15.4 All failures are dangerous undetected. The dangerous undetected
failure rate is 1/8760 = 0.000114155 failures per hour.
15.5 Using Equation 14-2, the approximate PFDavg is calculated as
0.125. At this high failure rate the approximation method is
expected to have some error. Use of the full equation or a Markov
based tool would eliminate the error.

References
1. A Fistful of Risks. Discover Magazine. NY: Walt Disney Magazine
Publishing Co., May 1996.

2. Marszal, E. M. and Scharpf, E. W. Safety Integrity Level Selection

Systematic Methods Including Layer of Protection Analysis. Research
Triangle Park: ISA, 2002.

3. IEC 61508-2000. Functional safety of electrical/electronic/programmable

electronic safety-related systems, Geneva: International Electrotechni-
cal Commission, 2000.

4. ANSI/ISA-84.00.01-2004 (IEC 61511 Mod), Functional Safety: Safety

Instrumented Systems for the Process Industry Sector. Research Trian-
gle Park: ISA, 2004.

5. exSILentia Users Manual. Sellersville: exida, 2008.

6. Certificate exida Certification, ROS 061218 C001. 3051S Safety Certi-

fied Pressure Transmitter. Chanhassen: Rosemount, 2008.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

7. Failure Modes Effects and Diagnostic Report, exida, Delta Controls S21
Pressure Switch, DRE 06/06-33 R001. Surrey, West Molesey: Delta
Controls Ltd., Oct. 2006.

8. Certificate exida Certification, ASC 041104 C001, 8320 Series Sole-

noid Valve. Florham Park: ASCO, 2006.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
9. Safety Equipment Reliability Handbook, Volume 3, Third Edition,
Page 139, Bettis CB/CBA Series Spring Return Actuator. Sellers-
ville: exida, 2007.

10. Safety Equipment Reliability Handbook, Volume 3, Third Edition,

Page 209, Virgo N Series Trunnion Mount Ball Valve. Sellersville:
exida, 2007.

11. Goble, W. M. and Cheddie, H. L. Safety Instrumented Systems Verifi-

cation Practical Probabilistic Calculations. Research Triangle Park:
ISA, 2005.

12. Wehrs, D., Detection of Plugged Impulse Lines Using Statistical Process
Monitoring Technology. Chanhassen: Rosemount, 2006.

Copyright International Society of Automation

The Language of Money

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

A reliability analysis will provide one or more measures of system success

or failure. These numbers develop meaning for an experienced engineer
who has done many reliability analysis comparisons. To an inexperienced
engineer and to corporate management, the significance of the difference
between a system with an availability of 0.99 and a system with an avail-
ability of 0.999 may be hard to grasp. This can be strikingly true when the
difference in purchase price between two systems may be many thou-
sands of dollars.

The language of money is readily understood, particularly by manage-

ment. Since this common quantitative language exists, a conversion
between reliability analysis output and the language of money is a useful
communication tool. System reliability, the probability that a system will
satisfactorily perform its intended function when required to do so if oper-
ated within its specified design limits, has a substantial effect on operating
cost; ask anyone who has ever owned an automobile. The relationship
between reliability and operating cost has been developed in an effective
conversion technique known as lifecycle cost analysis.

Lifecycle Costing (LCC) is a technique that encourages those who are

responsible for system selection to consider all the costs incurred over the
lifetime of a system rather than just the purchase costs. This comprehen-
sive attitude is especially relevant to industrial systems, where the cost of
equipment failure can be many times the initial cost of the equipment. One
fundamental concept is that all costs that will occur during the life of the equip-
ment are essentially determined at purchase time.

379
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
380 Control Systems Safety Evaluation and Reliability

The cost factors that need to be included in a lifecycle cost analysis vary
from system to system. Specific cost factors may not be relevant to a spe-
cific system. The object is to include all costs of procurement and opera-
tion over the lifecycle of a system. The various costs occur at different
times during system life. This is not a problem, since commonly-used life-
cycle costing techniques account for interest rates and the time value of
money.

The two primary categories of costs are procurement cost and operating
cost. These costs are added to obtain total lifecycle cost:

LCC = C PRO + C OP (16-1)

Given the same level of technology, a relationship exists between reliabil-

ity and lifecycle cost. At higher reliability levels, procurement costs are
higher and operating costs are lower. This is the fundamental concept
behind a portion of the quality movement. Their primary slogan, the title
of a book, is Quality is Free [Ref. 1]. Some experts say that the slogan
should have been: Quality is free for those who are willing to pay a little
more up front. Depending on the specific costs, an optimal point will
exist in which minimum lifecycle costs are achieved. This is illustrated in
Figure 16-1.
Cost

Total Cost
Procurement Cost

Operating Cost
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

0.9 Reliability 1.0

Figure 16-1. Optimal Cost

Copyright International Society of Automation

Procurement Costs
Procurement costs include system design cost, purchase cost of the equip-
ment including initial training, installation cost, and start-up/system com-
missioning cost. These costs occur only once. The total is obtained by
summing:

C PRO = C DESIGN + C PURCHASE + C INSTALLATION + C STARTUP (16-2)

System Design Costs

System design costs can be significant for complex control systems.
Included in this category are detailed design, system FMEA, drawings,
program management, system configuration, safety test verification, and
software. Much of the conceptual design work needs to be done in order
to prepare a request for quotation and is often not considered in lifecycle
cost.

The next step, detailed design work, usually represents the biggest cost.
The effort required depends to a great extent on the available system tools
and experience of the engineering staff. System tools that include a control
point database, good report-generation facilities, a graphics language pro-
gramming interface, and other computer-aided design assistance can
really cut costs. Drawings and other documentation costs are also affected
by the choice of tools; computer-aided graphics tools save time and reduce
system level design errors.

Good training can reduce engineering costs. This is especially true when
the engineers have no experience with the control system. A good training
program can jump start the design effort.

Configuration libraries are collections of proven control system configura-

tions. These libraries reduce detailed design time. Proven system configu-
rations can often be used as is, thus saving tremendous time and
reducing system level design errors. Even when a standard configuration
cannot be used, it is easier to modify an existing library design than to cre-
ate a new one from scratch.

System design costs are initially estimated. Engineering estimates are, by

nature, uncertain. Many factors that affect an estimate are generally unde-
fined when the estimate must be made. Given this uncertainty, it is best to
estimate upper and lower bounds on these numbers. The lifecycle cost can
then be determined with similar upper and lower bounds.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Purchase Costs
Purchase costs always include the cost of equipment. Sometimes
neglected are the additional costs of cabinetry, environmental protection
(air conditioning, etc.), wire termination panels, factory acceptance tests (if
required), initial training, initial service contracts, and shipping. Purchase
costs normally represent the focal point of any system comparison. This is
quite understandable, since capital expenditures often require a tedious
approval procedure and many projects compete for a limited capital bud-
get. These costs are also easy to obtain and have a high level of certainty.
However it is a mistake to consider only purchase cost.

Installation Costs
Installation costs must account for delivery of the equipment to the final
location, placing and mounting, any weatherization required, piping, and
wiring. These costs are affected by the design of the equipment. Small
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

modular cabinets are easier to move and mount compared to large cabinet
assemblies. Wiring is simplified when wire termination panels are easy to
access. Some systems offer field mountable termination panels to reduce
wiring costs. Remote I/O and distributed I/O are also concepts that
are designed to reduce wiring costs by placing the I/O near the
transmitter.

Start-Up Costs
Start-up costs must include the inevitable system test, debug process, and
safety functional verification. Configuration tools that help manage sys-
tem change, and automatic change documentation, can cut costs in this
area. Testing and full functional verification are a big portion of start-up
costs. Testing is expedited when the control system allows detail monitor-
ing and forcing of I/O points. On-line display of system variables, ideally
within the context of the system design drawings, can also speed the test-
ing and debugging of complex systems.

Operating Costs
Operating costs can overshadow procurement costs in many systems.
Depending on the consequences of lost production, operating costs may
dominate any lifecycle cost study. One experienced system designer
stated, Our plant manager simply cannot tolerate a shutdown due to con-
trol system failure! There are reasons for this attitude; the cost of a shut-
down can be extremely high.

The cost of a shutdown is not the only operating cost. Other operating
costs include the cost of system engineering changes (both software and

Copyright International Society of Automation

hardware), energy consumption costs, parts consumption costs, and fixed

and variable maintenance costs. These costs are generally incurred each
year. The cost of direct repair labor and the cost of lost production are
incurred whenever a failure occurs. These costs vary directly with system
failure probability.

In the case of Safety Instrumented Systems (SIS), the risk cost (Chapter 15)
is increased when the system probability of failure on demand goes up.
While safety ratings in terms of the required safety integrity level are usu-
ally dictated by a safety study, reduced risk cost may justify even higher
levels of safety integrity.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Engineering Changes
Engineering changes are part of operating costs. All systems are inevitably
changed. As a system is used, it becomes obvious that changes will
improve system operation. Everything cannot be anticipated by the origi-
nal designers. System-level design faults must be repaired. The strength of
the system design can be increased.

The cost of making a change can vary widely. Upper bound and lower
bound estimates should be made. Factors to be considered when making
the estimate include the procedures required to change the system, the
ease of changing the documentation, and the ease of testing and verifying
the changes. In the case of Safety Instrumented Systems a review of initial
hazard analysis and tasks such as updates to the periodic maintenance test
procedures must be considered. System capabilities affect these estimates.
Systems that have automatic updating of system design documentation
can be changed with much less effort.

Consumption Costs
System operation requires energy consumption and parts consumption.
These costs are typically estimated on an annual basis. Energy consump-
tion includes both the energy required to operate the control equipment
and the energy required to maintain an equipment environment when the
control system does not have sufficient internal strength to withstand spe-
cific environmental stresses.

Fixed Maintenance Costs

Certain maintenance costs are incurred each year regardless of the number
of system failures. Fixed maintenance costs include periodic inspection
and test, the cost of repair inventory, the cost of continued maintenance
training, annual costs of full time maintenance staff, and amortization of
test equipment. There is a relationship between these costs and system

Copyright International Society of Automation

failure probability. Low failure probability means low fixed maintenance

costs.

Cost of System Failure

One of the most significant operating costs can occur when a system fails.
In some systems, the specific failure mode is not important. A single fail-
ure state is all that is required for a reliability and safety model. If an oven
fails energized while baking a souffl, the dish is burned. If an oven fails
de-energized, the souffl falls and is ruined. In both cases, the cost of lost
product as well as repair costs must be paid.

Other systems have multiple failure modes that must be distinguished. If

the stove fails de-energized while heating some soup, we fix the stove and
then heat the soup. No product is lost. If the stove fails energized, the soup
is burned. The cost of that failure includes both the cost of the lost product
and the cost of repair.

The timing of the failure may affect cost. If the braking system of your
automobile fails just as another vehicle pulls in front of you, the costs are
likely to be high. Brake failure when coming to a stop on a long straight
empty road is likely to cost much less. This illustrates the concept of fail-
ure on demand. The concept is particularly important to an SIS. A cata-
strophic event can be extremely expensive.

All of these factors complicate the task of estimating the expected cost of
system failure. Though the determination is complicated, general risk
analysis techniques apply. Reliability analysis techniques provide the
probability of failure information. Reliability parameters can be used to
calculate the expected costs. Costs of failure are multiplied by the proba-
bility of failure.

Costs are estimated in two ways, time-duration-based and event-based.

Time duration based failure costs are those that are directly related to fail-
ure duration. A prime example is the cost of lost production in a continu-
ous process. The longer the system is down, the higher the cost. Event-
based failure costs are those related to an event. Examples of this include
replacement parts, fixed repair call charges or the cost of lost batch pro-
duction.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Time Duration-Based Failure Costs

In many systems the most significant failure costs are time based. A major-
ity of systems are fully repairable with one or two failure states. For those
systems, the cost of repair and the cost of lost production vary directly

Copyright International Society of Automation

with system downtime. Repair labor and lost production should be esti-
mated with consistent units (typically on a per hour basis). For a fully
repairable, single-failure state system:

C Failure = [( C R + C LP ) U ] operating hours (16-3)

where:

CR is the repair cost per hour

CLP is the cost of lost production per hour
U is unavailability (average probability of failure)

By convention, the cost of time duration-based failure is calculated on a

yearly basis. Therefore, the number of operating hours in one year is used
in Equation 16-3.

For fully repairable, multiple failure state systems:

C Failure = [( C R1 + C LP1 ) U1 + ( C R 2 + C LP 2 ) U2] operating hours (16-4)

where:

CR1 is the repair cost per hour for failure state 1

CLP1 is the cost of lost production per hour for failure state 1
CR2 is the repair cost per hour for failure state 2
CLP2 is the cost of lost production per hour for failure state 2
U1 is unavailability (average probability of failure) for failure state 1
U2 is unavailability (average probability of failure) for failure state 2

The number of operating hours used in Equation 16-3 and Equation 16-4 is
typically the number of hours in one year (8,760). In systems that do not
operate continuously, actual operational hours are used.

Event-based Failure Costs

In a fully repairable system, the average probability of a failure event can
be used to calculate yearly expected failure costs as a function of the cost
of each event.

C Failure = C E U E (16-5)

where CE is the cost per event and UE is the unavailability (probability of a

failure event) during the year.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Lifecycle Cost Analysis

The total operating cost is the sum of the failure costs (both event- and
time-based), the engineering change costs (CEC), the fixed maintenance
costs (CFM), and the consumption costs (CCC). If engineering change costs,
fixed maintenance costs, and consumption costs are estimated on a yearly
basis, they must be multiplied by the system lifetime (in years). Since the
cost of failure calculation accounts for operating hours in one year, that
cost must also be multiplied by the system lifetime. The total operating
costs are:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
C OP = [( C EC + C FM + C CC + C Failure ) years life] (16-6)
A lifecycle cost analysis requires a reliability analysis, costs estimates, and
a lifecycle analysis. The hardest part is sometimes getting started. What
factors must be accounted for in your system? Bringing together the points
that have been discussed earlier in the chapter, Table 16-1 shows a check-
list of lifecycle cost items that should be considered.

Table 16-1. Lifecycle Cost Checklist

Procurement Cost
System Design Cost Specification Preparation
Request for Quote
Detailed Design
Drawings
Periodic Test Procedures
System FMEA
Design Training
System Configuration
Purchase Cost Program Management
Control Hardware
Wire Termination Panels
Service Contract
Environmental Protection
Acceptance Test
Initial Spare Parts
Initial Maintenance Training
Installation Cost Weatherization
Mounting
Piping
Wiring
Startup Cost System Debugging
Safety Verification
Engineering Changes
Test Equipment
Operating Cost
Lost Production Cost
RISK COST
Maintenance Cost Repair Labor
Consumption Cost Spare Parts
Energy

Copyright International Society of Automation

EXAMPLE 16-1

Problem: For a ten-year operating life, calculate the approximate

lifecycle cost of a forging machine. Do not account for the time value
of money. Note, some of the costs listed in Table 16-1 are not shown
as they are not relevant to the example.

Solution: Our estimating department gives us the following data:

Design Cost
52 Hours @ $75/Hour Engineering Time
22 Hours @ $45/Hour Drawing/Documentation Time
16 Hours @ $75/Hour Safety Review
One repair technician for 10 machines.

Consumption Costs
Electricity $1,200/Year
Lubricating Oil $200/Year
Filters $100/Year

Failure Costs
Repair Labor Rate $100/Hour
Lost Production Cost $2,000/Hour

To calculate the procurement costs, we add all the expenses:

52 $75 = $3,900
22 $45 = $990
16 $75 = $1,200
Purchase $120,000

Purchase Cost - $120,000

Installation Cost
Truck Rental $300
32 Hours @ $75/Hour

Start-Up Cost
Training Course Fee $1,500
80 Hours @ $75/Hour Training Time
10 Hours @ $75/Hour Equipment Assembly

Engineering Change Cost $500/Year

Fixed Maintenance Costs $5,000/Year

Truck Rental $300
32 $75 $2,400
Course $1,500
80 $75 = $6,000
10 $75 = $750
_________
$137,040

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE 16-1 continued

The reliability analysis has determined that the forging machine has a
steady-state availability of 98.04% and a steady-state unavailability of
1.96%. Using unavailability, the yearly failure cost is calculated by
using Equation 16-3.

$2,100 0.0196 8760 hours/year = $360,561.60

To calculate total operating costs, we first add yearly expenses:

Engineering Changes $500

Fixed Maintenance $5,000
Electricity $1,200
Oil $200
Filters $100
Failure Cost $360,561.60
______
Total Yearly $367,561.60 / year

These are multiplied by the expected life:

$367,561.60 10 = $3,675,616

Lifecycle costs are the sum of procurement and operating costs:

$3,675,616 + $137,040 = $3,812,656

Please note that failure costs for this example are much larger than
other costs.

EXAMPLE 16-2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Problem: A different forging machine is available. It has a purchase

cost of $200,000. The availability improves from 98.04% to 98.9%.
What is the difference in lifecycle cost?

Solution: The difference in purchase cost is +$80,000.The new

failure cost is found using Equation 16-3:

$2,100 (1 - .989) 8760 hours/year 10 years = $2,023,560

The difference is lifecycle cost is:

$2,023,560 - $3,675,616 + $80,000 = -$1,572,056

We have discovered that an increase in purchase cost of $80,000

results in savings of $1,572,056 over ten years.

Copyright International Society of Automation

EXAMPLE 16-3

Problem: An extruding machine incorporates a system to control

material feed temperature. The control system has a procurement
cost of $270,000. Operating costs include engineering change costs
of $2,000 per year, maintenance costs of $5,000 per year, and
consumption costs of $1,000 per year. The machine and its control
system are inspected and overhauled each year.

There are two failure modes in the temperature control system. If the
control system fails with its outputs energized, the excess heat will
destroy the entire extruding machine. Including the cost of lost
production, this event would cost $1,200,000. If the temperature
control system fails with its outputs de-energized, the heat is removed
and the material in the extruder will harden. The extruding machine
must be rebuilt if this occurs. The cost of a machine rebuild, including
the cost of lost production, is $200,000 per event.

The reliability analysis shows an average probability of failure in the

energized state to be 0.0017 for the one-year interval. The probability
of failure in the de-energized state is 0.06 for the one-year interval.
Find the lifecycle cost for a 10-year life.

Solution: Procurement costs are given. Operating costs, including

the cost of failure, must be determined for each year. The cost of
failure is obtained by multiplying the cost of an event by the
probability of that event, Equation 16-5. In this case, the cost of failure
is:

(0.0017 $1,200,000) + (0.06 $200,000) = $14,040/year

Total operating costs for a 10-year life are:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

($2,000 + $5,000 + $1,000 + $14,040) 10 = $220,100

This is added to the procurement cost to obtain a 10-year lifecycle

cost of $490,100.

Time Value of Money

More precision can be added to lifecycle cost estimates if the time value of
money is included in the analysis. This is because the value of money in
one year cannot be compared to the value of money from an earlier or
later period of time. The value may change. This is especially true when
systems are used for longer lifetimes. Money spent early in the life of a
system is weighted more because interest swells its value.

Copyright International Society of Automation

Discount Rate
Almost everyone is familiar with the concept of compound interest. An
amount of money (or principal) is invested at a particular interest rate. The
interest earned is reinvested so that it too earns interest. Inflation is
another familiar concept. When considering the time value of money, both
interest rates and inflation rates must be taken into account. A term that
combines both is called the discount rate. While it might seem logical to
simply take the interest rate and subtract the inflation rate, the concept of
discount rate also includes some judgment regarding the uncertainty of
numbers estimated for the future. Financial experts also consider financial
risk. Risky projects are often given a higher discount rate. The discount
rate R is expressed as a percentage. The best source for the discount rate
estimate is the corporate financial officer.

If a principal of M dollars is invested at a percentage of 100R (5% means R

= 0.05) then the compounded amount of money at the beginning of the
second year is:

M + MR

This amount is considered the principal for the second year. At the begin-
ning of the third year, the compounded amount of money is:

(M + MR) + (M + MR)R (16-7)

Factoring out M, we obtain:

M[(1 + R) (1 + R)] = M(1 + R )2 (16-8)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This formula can be generalized for any number of years because the pat-
tern continues. For any quantity of years, N, the future value of money,
FV, after N years equals:

FV = M(1 + R)N (16-9)

EXAMPLE 16-4

Problem: A purchase cost of $250,000 is required for a control

system. What is the equivalent future value of this investment for a 10
year period at a 6% per year rate?

Solution: Using Equation 16-9,

FV = $250,000 x (1+0.06)10 = $250,000 x 1.7908 = $447,700

Copyright International Society of Automation

Present Value
Another way to account for the time value of money is to calculate the
present value of some future expense. This is the equivalent of asking how
much money to invest now in order to pay for some future expense. The
equation for present value is obtained from Equation 16-9. Look at Exam-
ple 16-4. The compound amount $447,700 is the future value. The princi-
pal of $250,000 is the present value of $447,700. To directly calculate a
present value when a future value is known, starting with Equation 16-9
and substituting M for FV and PV for M, the equation for PV is:

M
PV = --------------------- (16-10)
N
(1 + R)

where:

PV is the present value of money

M is the amount of future money
R is the discount rate
N is the number of time periods

EXAMPLE 16-5

Problem: An operating expense of $5,000 will be incurred five years

from now. What is the present value of this expense if the discount
rate is 5%?

Solution: Using Equation 16-10:

PV = $5,000 / (1+0.05)5 = $3,917.64

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

EXAMPLE 16-6

Problem: Re-evaluate the lifecycle cost analysis of Example 16-1,

accounting for the time value of money. Assume that the average rate
is 5%, compounded yearly. Calculate the cost at the end of a 10-year
operating life.

Solution: Costs must be calculated on a yearly basis. Those costs

are listed for each year in which they are incurred. Equation 16-9 is
then used to calculate added finance cost for each year. First, given
cost data is summarized.

Given Cost Data:

Procurement Costs $137,000.00
Yearly Fixed Costs $7,000.00

Copyright International Society of Automation

EXAMPLE 16-6 continued

The failure costs on a yearly basis must be calculated. Since time

duration failure costs were provided for the example, yearly costs are
obtained by multiplying the probable number of downtime hours in a
year by the cost of each hour.

CFAILURE = $2,100 0.0196 8760 hours/year

= $360,561.60 / year

Total yearly costs are the sum of the yearly fixed costs and the yearly
failure costs. Total yearly costs are $367,561.60.

For each cost, calculate the cost with discount. This is done using
Equation 16-9. Initial costs must include finance charges for the
entire 10-year period. The calculation is as follows:

FV = $137,000 (1.05)10 = $233,158.56

For yearly costs, a similar calculation for the appropriate period is

done. For the first yearly cost:

FV = $367,561.60 (1.05)9 = $570,208.68

The calculation is repeated for each year and the results are added.
A personal computer spreadsheet program is highly recommended
for this type of problem. Such programs are quick to set up, reduce
mistakes, and provide great flexibility. Discount rates and costs can
be varied each year. A listing obtained from a spreadsheet shows
yearly costs for the problem.

Costs per Year:

Initial Costs $223,158.56

Year 1 Costs $570,208.68
Year 2 Costs $543,055.89
Year 3 Costs $517,196.08
Year 4 Costs $492,567.70
Year 5 Costs $469,112.09
Year 6 Costs $446,773.42
Year 7 Costs $425,498.50
Year 8 Costs $405,236.66
Year 9 Costs $385,939.68
Year 10 Costs $367,561.60

Lifecycle costs (in terms of future value) = $4,846,308.87.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-

Copyright International Society of Automation

EXAMPLE 16-7

Problem: The accounting department has advised that lifecycle costs

are normally calculated in terms of present value, not future value.
Recalculate Example 16-1, accounting for the time value of money in
terms of present value.

Solution: For each yearly expense, Equation 16-10 can be used to

convert future amounts into present value. For the first year:

PV = $367,561.60 / (1.05)1 = $350,058.67

Using a spreadsheet, the calculations are repeated for each year.

Year 1 Present Value 350,058.67

Year 2 Present Value 333,389.21
Year 3 Present Value 317,513.53
Year 4 Present Value 302,393.84
Year 5 Present Value 287,994.13

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Year 6 Present Value 274,280.13
Year 7 Present Value 261,219.17
Year 8 Present Value 248,780.16
Year 9 Present Value 236,933.48
Year 10 Present Value 225,650.94
Initial costs 137,000.00

Lifecycle costs (in terms of present value) = $2,975,213.25.

The result can be checked by calculating the future value of this

amount. Using Equation 16-9,

FV = 2,975,213.25 (1.05)10 = $4,846,308.87.

This result agrees with Example 16-6.

Annuities
An annuity is a sequence of payments made at fixed periods of time over a
given time interval (it is assumed that payments are made at the end of
each period). Yearly lifecycle costs can be modeled as an annuity. Both
future value costs and present value costs can be calculated.

The future value of an annuity can be obtained by using Equation 16-9 for
each year.
FV = M + M(1 + R)1 + M(1 + R)2 + + M(1 + R)N1

If this equation is multiplied by (1 + R), the result is:

FV(1 + R) = M(1 + R) + M(1 + R)2 + + M(1 + R)N

Copyright International Society of Automation

When these two equations are subtracted:

FV FV(1 + R) = M + M(1 + R) M(1 + R) + M(1 + R)N

Therefore,
FV(1 1 R) = M M(1 + R)N

This can be simplified into:

N
[(1 + R) 1]
FV A = M ----------------------------------- (16-11)
R

where FVA is the future value of an annuity.

EXAMPLE 16-8

Problem: No spreadsheet program is available to calculate lifecycle

cost. Solve Example 16-6 (lifecycle cost in future value) without a
spreadsheet.

Solution: The yearly costs can be modeled as an annuity. Using

Equation 16-11:

FV = 367,561.60 ((1.05)10 - 1) / 0.05 = $4,623,150.30

This is added to the future value of the initial expenditure,

$223,158.56, to obtain a total of $4,846,308.87.

In many cases, it is customary to calculate lifecycle costs in terms of

present value rather than future value (as in Example 16-8). The present
value of an annuity is the sum of the present values of all payments. It rep-
resents the amount of money that must be invested now in order to make
the required future payments.

The present value of an annuity can be obtained using Equation 16-10.

Assuming payments are made at the end of a period, for N payments of M
(dollars) at a discount rate of R:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

PVA = M(1 + R)1 + M(1 + R)2 + + M(1 + R)N

This can be algebraically arranged as:

N
1 (1 + R)
PV A = M --------------------------------- (16-12)
R

Copyright International Society of Automation

EXAMPLE 16-9

Problem: Evaluate the lifecycle cost analysis of Example 16-1

accounting for the time value of money in terms of present value
without using a spreadsheet. Assume that the discount rate is 5%.

Solution: Total yearly costs of $367,561.60 are the equivalent of an

annuity. Thus, Equation 16-12 can be used to determine the present
value for yearly costs.

PV = 367,561.60 [1 - (1.05)-10]/0.05 = $2,838,213.25

This is added to the initial costs of $137,000 to obtain a result of

$2,975,213.25. This result agrees with that of Example 16-7.

Safety Instrumented System Lifecycle Cost

An SIS represents a special case as failure costs can be very high (Ref. 2).
These costs primarily consist of risk costs. Risk costs are not necessarily
paid each year unless they are paid as insurance premiums. However, risk
costs can and should be considered in a lifecycle cost analysis.

Risk costs are calculated by multiplying the cost of an event times the
probability of an event. If an SIS is used, the probability of an event equals
the probability of an event without the SIS times the PFD of the SIS.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PEVENT with SIS = PEVENT without SIS PFDSIS (16-13)

Risk costs are included in yearly costs and may be treated as an annuity.

EXAMPLE 16-10
Problem: With no risk reduction, the probability of an event is 0.01. If
an SIS is added with a PFD of 0.001, what is the probability of an
event?
Solution: Using Equation 16-13, PEVENTwith SIS = 0.01 0.001 =
0.00001.

Other cost factors that must be considered when adding an SIS to a pro-
cess include the cost of a false trip. An SIS may fail safe, falsely tripping
the system. The decreased risk cost must be greater than the increased
production downtime cost. Other lifecycle cost factors are much the same
for an SIS as for other process control systems.

Copyright International Society of Automation

EXAMPLE 16-11

Problem: A decision must be made as to purchasing an SIS. The

purchase and installation cost of the SIS is $50,000. The SIS is to be
inspected every year and has a PFD of 0.001 and a PFS of 0.01 for
the one year interval. Yearly estimated operational costs for the SIS
are $600. The cost of a false trip is $1000 per trip. The probability of
an event without the SIS is 0.01. The cost of an event is $2,000,000.
The system will be used for longer than five years. The discount rate
is 5%. Should the SIS be added to the system?

Solution: The risk cost without the SIS equals 0.01 $2,000,000 =
$20,000 / year. The risk cost with the SIS equals 0.01 0.001
$2,000,000 = $20 / year. With the SIS added to the system
incremental trip costs equal 0.01 $1000 = $10 / year. The cost
comparisons are:

No SIS SIS
Procurement Costs: $0 $50,000
Yearly Risk/Failure Cost: $20,000 $30
Yearly Operational Cost: $0 $600

Converting yearly expenses to present value and adding the totals for
each year:

No SIS Discount 5%
Total Yearly $20,000 Cumulative $0
PV year 1 $19,048 Year 1 $19,048
PV year 2 $18,141 Year 2 $37,188
PV year 3 $17,277 Year 3 $54,465
PV year 4 $16,454 Year 4 $70,919
PV year 5 $15,671 Year 5 $86,590
Total lifecycle costs for five years $86,590

SIS Discount 5%
Total Yearly $630 Cumulative $50,000
PV year 1 $600 Year 1 $50,600
PV year 2 $571 Year 2 $51,171
PV year 3 $544 Year 3 $51,716
PV year 4 $518 Year 4 $52,234
PV year 5 $494 Year 5 $52,728
Total lifecycle costs for five years $52,728

Since the process is to be operated for five years or more, operating

with the SIS is less expensive. The SIS should be added.
--``,,`,,,`,,`,`,,,```,

Copyright International Society of Automation

Exercises
16.1 A control system has a procurement cost of $100,000. Fixed yearly
costs are $5,000. The system has an availability of 99%. The failure
costs are $1,000 per hour. The system will be used for five years.
During this time period, inflation and interest are identical so the
discount rate will be zero (no need to account for the time value of
money). Calculate the lifecycle cost.
16.2 The control system of Exercise 16.1 is fully repairable, with avail-
ability calculated using MTTF equals 9900 hours and MTTR equals
100 hours. An expert system can be purchased at a cost of $10,000.
This expert system will diagnose error messages and identify
failed components. The MTTR will be reduced to 50 hours. What is
the new availability? What is the new lifecycle cost? Should the
expert system be purchased?
16.3 The control system of Exercise 16.1 is available with dual redun-
dant modules. The extra modules will increase fixed yearly costs to
$6,000. The MTTF will increase to 100,000 hours. The MTTR
remains at 50 hours when the expert system is used. The procure-
ment cost increases to $210,000. What is the new availability? What
is the new lifecycle cost? Should the dual redundant modules be
purchased?
16.4 An expenditure of $100,000 must be made. The discount rate is 4%.
What is the future value of this expenditure after five years?

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
16.5 An expenditure of $150,000 at todays cost must be made in five
years. What is the amount that must be invested now (present
value) in order to purchase the item in five years? Assume that the
discount rate is 5%.
16.6 Repeat Exercise 16.1 assuming that the discount rate is 5%. How
much did the lifecycle cost change?

Copyright International Society of Automation

Answers to Exercises
16.1 Yearly failure costs equal 0.01 $1000 per hour 8760 hours per
year = $87,600. The totals are:
Availability 0.99
Procurement Costs: $100,000
Yearly Risk/Failure Cost: $87,600

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Yearly Operational Cost: $5,000
Discount Rate 0%
Total Yearly $92,600 Cumulative $100,000
1 $92,600 Year 1 $192,600
2 $92,600 Year 2 $285,200
3 $92,600 Year 3 $377,800
4 $92,600 Year 4 $470,400
5 $92,600 Year 5 $563,000

16.2 The cost is justified.

Availability 0.99497487
Procurement Costs: $110,000
Yearly Risk/Failure Cost: $44,020
Yearly Operational Cost: $5,000
Discount Rate 0%
Total Yearly $49,020 Cumulative $110,000
1 $49,020 Year 1 $159,020
2 $49,020 Year 2 $208,040
3 $49,020 Year 3 $257,060
4 $49,020 Year 4 $306,080
5 $49,020 Year 5 $355,101

16.3 The cost is justified.

Availability 0.99950025
Procurement Costs: $210,000
Yearly Risk/Failure Cost: $4,378
Yearly Operational Cost: $6,000
Discount Rate 0%
Total Yearly $10,378 Cumulative $210,000
1 $10,378 Year 1 $220,378
2 $10,378 Year 2 $230,756
3 $10,378 Year 3 $241,133
4 $10,378 Year 4 $251,511
5 $10,378 Year 5 $261,889

16.4 FV = $100,000 (1.04)5 = $121,665.

16.5 $ 117,528.

Copyright International Society of Automation

16.6 At a discount rate of 5%, the lifecycle cost dropped from $563,000
to $500,910.
Availability 0.99
Procurement Costs: $100,000
Yearly Risk/Failure Cost: $87,600
Yearly Operational Cost: $5,000
Discount Rate 5%
Total Yearly $92,600 Cumulative $100,000
1 $88,190 Year 1 $188,190
2 $83,991 Year 2 $272,181
3 $79,991 Year 3 $352,173
4 $76,182 Year 4 $428,355
5 $72,555 Year 5 $500,910

References
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

1. Crosby, P. H. Quality is Free. NY: McGraw-Hill, 1979.

2. Miller, C. WIN-WIN: A Managers Guide to Functional Safety. Sellers-

ville: exida, 2008.

Copyright International Society of Automation

0.00 0.01 0.02 0.03 0.04

0.0 0.500000 0.503989 0.507978 0.511967 0.515954
0.1 0.539828 0.543796 0.547759 0.551717 0.555670
0.2 0.579260 0.583167 0.587065 0.590955 0.594835

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.3 0.617912 0.621720 0.625516 0.629301 0.633072
0.4 0.655422 0.659098 0.662758 0.666403 0.670032
0.5 0.691463 0.694975 0.698469 0.701945 0.705402
0.6 0.725748 0.729070 0.732372 0.735654 0.738915
0.7 0.758037 0.761149 0.764239 0.767306 0.770351
0.8 0.788146 0.791031 0.793893 0.796732 0.799547
0.9 0.815941 0.818590 0.821215 0.823816 0.826392
1.0 0.841346 0.843754 0.846137 0.848496 0.850831
1.1 0.864335 0.866502 0.868644 0.870763 0.872858
1.2 0.884931 0.886862 0.888769 0.890653 0.892513
1.3 0.903201 0.904903 0.906584 0.908242 0.909878
1.4 0.919244 0.920731 0.922197 0.923643 0.925067
1.5 0.933194 0.934479 0.935745 0.936993 0.938221
1.6 0.945202 0.946302 0.947385 0.948450 0.949498
1.7 0.955435 0.956368 0.957285 0.958186 0.959071
1.8 0.964070 0.964853 0.965621 0.966376 0.967117
1.9 0.971284 0.971934 0.972572 0.973197 0.973811
2.0 0.977251 0.977785 0.978309 0.978822 0.979325
2.1 0.982136 0.982571 0.982998 0.983415 0.983823
2.2 0.986097 0.986448 0.986791 0.987127 0.987455
2.3 0.989276 0.989556 0.989830 0.990097 0.990359
2.4 0.991803 0.992024 0.992240 0.992451 0.992657
2.5 0.993791 0.993964 0.994133 0.994297 0.994458
2.6 0.995339 0.995473 0.995604 0.995731 0.995855
2.7 0.996533 0.996636 0.996736 0.996834 0.996928
2.8 0.997445 0.997523 0.997599 0.997673 0.997745
2.9 0.998134 0.998193 0.998250 0.998305 0.998359
3.0 0.998650 0.998694 0.998736 0.998777 0.998817
3.1 0.999033 0.999065 0.999096 0.999126 0.999156

401
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
402 Control Systems Safety Evaluation and Reliability

0.00 0.01 0.02 0.03 0.04

3.2 0.999313 0.999337 0.999359 0.999381 0.999403
3.3 0.999517 0.999534 0.999550 0.999566 0.999581
3.4 0.999663 0.999675 0.999687 0.999698 0.999709
3.5 0.999768 0.999776 0.999784 0.999792 0.999800
3.6 0.999841 0.999847 0.999853 0.999859 0.999864
3.7 0.999892 0.999897 0.999901 0.999904 0.999908
3.8 0.999928 0.999931 0.999933 0.999936 0.999939
3.9 0.999952 0.999954 0.999956 0.999958 0.999959
4.0 0.999969 0.999970 0.999971 0.999972 0.999973
4.1 0.999980 0.999980 0.999981 0.999982 0.999983
4.2 0.999987 0.999987 0.999988 0.999989 0.999989
4.3 0.999992 0.999992 0.999992 0.999993 0.999993
4.4 0.999995 0.999995 0.999995 0.999995 0.999996
4.5 0.999997 0.999997 0.999997 0.999997 0.999997
4.6 0.999998 0.999998 0.999998 0.999998 0.999998

0.05 0.06 0.07 0.08 0.09

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.0 0.519939 0.523922 0.527903 0.531882 0.535857
0.1 0.559618 0.563560 0.567495 0.571424 0.575346
0.2 0.598707 0.602569 0.606420 0.610262 0.614092
0.3 0.636831 0.640577 0.644309 0.648028 0.651732
0.4 0.673646 0.677243 0.680823 0.684387 0.687934
0.5 0.708841 0.712261 0.715662 0.719044 0.722406
0.6 0.742155 0.745374 0.748572 0.751749 0.754904
0.7 0.773374 0.776374 0.779351 0.782306 0.785237
0.8 0.802339 0.805107 0.807851 0.810571 0.813268
0.9 0.828945 0.831474 0.833978 0.836458 0.838914
1.0 0.853142 0.855429 0.857692 0.859930 0.862145
1.1 0.874929 0.876977 0.879001 0.881001 0.882978
1.2 0.894351 0.896166 0.897959 0.899729 0.901476
1.3 0.911493 0.913086 0.914658 0.916208 0.917737
1.4 0.926472 0.927856 0.929220 0.930564 0.931889
1.5 0.939430 0.940621 0.941793 0.942948 0.944084
1.6 0.950529 0.951544 0.952541 0.953522 0.954487
1.7 0.959942 0.960797 0.961637 0.962463 0.963274
1.8 0.967844 0.968558 0.969259 0.969947 0.970622
1.9 0.974413 0.975003 0.975581 0.976149 0.976705
2.0 0.979818 0.980301 0.980774 0.981238 0.981692
2.1 0.984223 0.984614 0.984997 0.985372 0.985738
2.2 0.987776 0.988090 0.988397 0.988697 0.988990
2.3 0.990614 0.990863 0.991106 0.991344 0.991576
2.4 0.992858 0.993054 0.993245 0.993431 0.993613
2.5 0.994614 0.994767 0.994915 0.995060 0.995202
2.6 0.995976 0.996093 0.996208 0.996319 0.996428
2.7 0.997021 0.997110 0.997197 0.997282 0.997365
2.8 0.997814 0.997882 0.997948 0.998012 0.998074
2.9 0.998411 0.998462 0.998511 0.998559 0.998605
3.0 0.998856 0.998894 0.998930 0.998965 0.998999
3.1 0.999184 0.999211 0.999238 0.999264 0.999289
3.2 0.999423 0.999443 0.999462 0.999481 0.999499
3.3 0.999596 0.999611 0.999624 0.999638 0.999651
3.4 0.999720 0.999730 0.999740 0.999750 0.999759

Copyright International Society of Automation

0.05 0.06 0.07 0.08 0.09

3.5 0.999808 0.999815 0.999822 0.999828 0.999835
3.6 0.999869 0.999874 0.999879 0.999884 0.999888
3.7 0.999912 0.999915 0.999919 0.999922 0.999925
3.8 0.999941 0.999944 0.999946 0.999948 0.999950
3.9 0.999961 0.999963 0.999964 0.999966 0.999967
4.0 0.999975 0.999976 0.999977 0.999978 0.999979
4.1 0.999984 0.999984 0.999985 0.999986 0.999986
4.2 0.999990 0.999990 0.999990 0.999991 0.999991
4.3 0.999993 0.999994 0.999994 0.999994 0.999995
4.4 0.999996 0.999996 0.999996 0.999996 0.999997
4.5 0.999998 0.999998 0.999998 0.999998 0.999998
4.6 0.999999 0.999999 0.999999 0.999999 0.999999

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Matrix
Math

The Matrix
A matrix is an array of numeric values or algebraic variables. The matrix is
written with rows and columns enclosed in brackets. There is no single
numerical value. An example 3 3 matrix (third order) is shown below.

a 11 a 12 a 13
P = a 21 a 22 a 23
a 31 a 32 a 33

Column Matrix
If a matrix has only one column, it is known as a column matrix or column
vector. An example is shown below.

3
C = 1
0.6
2

Row Matrix
If a matrix has only one row, it is known as a row matrix or row vector. An
example is shown below.

S = 1 0 0 00

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
405
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
406 Control Systems Safety Evaluation and Reliability

Identity Matrix
The identity matrix is a square matrix in which all the elements are zero
except those on a diagonal from upper left to lower right. The diagonal
elements are unity.

1 0 0 0 0
0 1 0 0 0
I = 0 0 1 0 0
0 0 0 1 0
0 0 0 0 1

Matrix Addition
Two matrices of the same size can be added. The elements in
corresponding positions are summed.

a b c + g h i = a+g b+h c+i (B-1)

d e f j k l d+j e+k f+l

Matrix Subtraction
Two matrices of the same size can be subtracted. The elements in
corresponding positions are subtracted.

a b c g h i = ag bh ci (B-2)
d e f j k l dj ek fl

Matrix Multiplication
Two matrices may be multiplied if the number of columns of the first
matrix equals the numbers of rows of the second matrix. The result will be
a matrix that has a quantity of rows equal to that of the first matrix and a
quantity of columns equal to that of the second matrix.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

g h
a b c ag+bi+ck ah+bj+cl (B-3)
i j =
d e f dg+ei+fk dh+ej+fl
k l

Copyright International Society of Automation

Matrix Inversion
There is no procedure that allows matrices to be divided. However, we
can obtain the reciprocal or inverse of a square matrix. The inverse of
one matrix can be multiplied by another matrix in a manner analogous to
algebraic division.

The inverse (M1) of a square matrix (M) is another square matrix defined
by the relation:

1
MM = I (B-4)

where I is the identity matrix. This is similar to stating that a number

multiplied by its inverse equals one.

1
4 --- = 1
4

EXAMPLE B-1

Problem: Multiply the matrix M by the matrix M 1, where:

4 0 5
M = 0 1 6
3 0 4

and

4 0 5
1
M = 18 1 24
3 0 4

Solution: When these two matrices are multiplied:

4 0 5 4 0 5 1 0 0
0 1 6 18 1 24 = 0 1 0
3 0 4 3 0 4 0 0 1

All square matrices are not invertible. In reliability evaluation, we are

normally working with matrices derived from the transition matrix. These
usually do not present a problem.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Matrix Inversion via Row Manipulation

One method of obtaining the inverse of a matrix requires a sequence of
row manipulations applied to the target matrix in combination with an
identity matrix of the same order. The two matrices are written side by
side in the form:

a b c 1 0 0
d e f 0 1 0
g h i 0 0 1

Row operations are applied to this combination matrix until it appears in

the form:

1 0 0 j k l
0 1 0 m n o
0 0 1 p q r

The matrix:

j k l
1
M = m n o
p q r

is the inverse of the matrix:

a b c
M = d e f
g h i

The row operations are:

1. Any two rows can be interchanged;
2. Any row may be multiplied by a nonzero scalar; and
3. Any row may be added to a multiple of another row.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE B-2

Problem: Obtain the inverse of the matrix:

4 0 5
M = 0 1 6
3 0 4

Solution: The matrix is written along with the identity matrix in a

composite form:

4 0 5 1 0 0
0 1 6 0 1 0
3 0 4 0 0 1

The objective is to manipulate matrix rows until the left side of the
composite equals the identity matrix. One good strategy is to put zeros
into the left side. As a first step, manipulate rows in order to replace
the five with a zero. Multiply row 3 by 5/4. The result is

4 0 5 1 0 0
0 1 6 0 1 0
15 5
------ 0 5 0 0 ---
4 4

Replace row 1 with (row 1 row 3).

1 5
1 0 ---

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--- 0 0
4 4
0 1 6 0 1 0
15 5
------ 0 5 0 0 ---
4 4

Next, replace the 15/4 with a zero. To accomplish this, use rule 2; any
row may be multiplied by a nonzero scalar (row 1 = 15 row 1).

15 75
------ 0 0 15 0 ------
4 4
0 1 6 0 1 0
15 5
------ 0 5 0 0 ---
4 4

Copyright International Society of Automation

EXAMPLE B-2 continued

Next, use rule 3; any row may be added to the multiple of another row
(row 3 = row 3 row 1).

75
15 15 0 ------
------ 0 0 4
4
0 1 0
0 1 6
80
0 0 5 15 0 ------
4

The 6 on the left side is the next target. Multiply row 3 by 6/5.

15
------ 0 0 15 0 75
------
4 4
0 1 6 0 1 0
0 0 6 18 0 24

Using rule 3, row 2 is replaced by (row 2 + row 3).

15
------ 0 0 15 0 75
------
4 4
0 1 0 18 1 24
0 0 6 18 0 24

Row 3 is multiplied by 1/6 and row 1 is multiplied by 4/15.

1 0 0 4 0 5
0 1 0 18 1 24
0 0 1 3 0 4

The identity matrix is present in the left side of the composite matrix.
The job is finished and the inverted matrix equals:

4 0 5
1
M = 18 1 24
3 0 4
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE B-3

Problem: Figure 9-1 shows the Markov model for an ideal dual-
controller system. The I Q matrix for this model is

I Q = 2 2
+

The equation for MTTF (Equation 9-1) is found from the N matrix, the
inverse of the I Q matrix. Invert the I Q matrix and find the equation
for MTTF.

Solution: The composite matrix is written:

2 2 1 0
+ 0 1

Multiply row 1:

+
R1 = R1 -------------
2

+
+ + ------------- 0
2
+
0 1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Row 1 = row 1 + row 2:

+
0 ------------- 1
2
+
0 1

Multiply row 1 and replace row 2 with the sum of row 1 and row 2.

R1 = R1 ---

( + )
---------------------------- ---
2
0 2
0 + ( + ) +
---------------------------- -------------
2
2

Copyright International Society of Automation

EXAMPLE B-3 continued

Finally, rescale the results by multiplying both row 1 and row 2.

1
R1 = R1 ---

1
R2 = R2 -------------
+

+ 1
------------- ---
2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

1 0 2
0 1 1
---------- ---
2
2

The inverse of the I Q matrix is the N matrix.

+ 1
------------- ---
2
N = 2
1
---------- ---
2
2

MTTF is obtained by adding the row elements of the starting state.

Assuming that the system starts in state 0, the upper row is added.

+ 2 3 +
MTTF = ------------- + ---------- = ----------------
2 2 2
2 2 2

Copyright International Society of Automation

Probability
Theory

Introduction
There are many events in the world for which there is not enough knowl-
edge to accurately predict an outcome. If a penny is flipped into the air, no
one can predict with certainty that it will land heads up. If a pair of dice
is rolled, no one possesses enough knowledge to state which numbers will
appear face up on each die. These events are called random.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

While random events like a coin toss do not seem related to reliability
engineering, the concept of a random event as previously defined in
Chapter 3 is the same. In reliability analysis the experiment is operating
the device for another time intervalthe possible outcomes of the experi-
ment are successful operation, failure in mode 1, failure in mode 2, etc. If a
controller module is run for another hour will it fail or not? This is much
like asking if the result of the next coin flip will be heads or tails.

It is conceivable, given sufficient information such as the precise weight

distribution of a penny, the exact direction and magnitude of the launch-
ing force vector, the air friction, and the spring rate of the landing surface,
that one could calculate with great certainty whether the penny would
land heads up. But outcome calculation of the penny flip process, like
many processes, is not practical or even possible, given the limits of the
real world. In many situations, the relationships that affect the process
may not be adequately understood. The input variables may not be mea-
surable with sufficient precision. For these processes, one must resort to
statistics in combination with probability theory.

413
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
414 Control Systems Safety Evaluation and Reliability

Probability
Probability is a quantitative method of expressing the likelihood of an
event. A probability is assigned a number between zero and one, inclu-
sive. A probability assignment of zero means that the event is never
expected. A probability assignment of one means that the event is always
expected.

Probabilities are often assigned based on historical frequency of occur-

rence. An experiment is repeated many times, say N. A quantity is tabu-
lated for each outcome of the experiment. For any particular outcome, the
probability is determined by dividing the number of occurrences, n, by the
number of trials.

n Number of occurrences of event E

P ( E ) = ---- = ------------------------------------------------------------------------------------------ (C-1)
N Number of trials

The values become more certain as the number of trials is increased. A

definition of probability based on this concept is stated in Equation C-2.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
n
P ( E ) = lim ---- (C-2)
N N

Venn Diagrams
A convenient way to depict the outcomes of an experiment is through the
use of a Venn diagram. These diagrams were created by John Venn (1834-
1923), an English mathematician and cleric. They provide visual represen-
tation of data sets, including experimental outcomes. The diagrams are
drawn by using the area of a rectangle to represent all possible outcomes;
this area is known as the sample space. Any particular outcome is shown by
using a portion of the area within the rectangle.

A fair coin is defined as a coin that is equally likely to give a heads result
or a tails result. For a fair coin flip, the Venn diagram of possible outcomes
is shown in Figure C-1. There are two expected outcomes: heads and tails.
Each has a well-known probability of one-half. The diagram shows the
outcomes, with each allocated area in proportion to its probability.

For the toss of a fair pair of dice (fair meaning that every number is
equally likely to come up on each die), the possible outcomes are shown in
Figure C-2. The outcomes do not occupy the same area on the diagram.
The probabilities of some outcomes are more likely than others; these
occupy more area. The area occupied by each outcome is proportional to

Copyright International Society of Automation

Figure C-1. Venn Diagram - Coin Toss

its probability. For example, the area occupied by an outcome of 2 is

1/36 of the total. The area occupied by the outcome 7 is 6/36 of the total.

A Venn diagram is often used to identify the attributes of possible out-

comes. Outcomes are grouped into event sets according to some character-
istic or combination of characteristics. The graphical nature of the Venn
diagram is especially useful in showing these relationships between sets:
unions, intersections, complementary, and mutually exclusive.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure C-2. Venn Diagram - Dice Throw

A union of sets (A, B, C) is defined as all events in set A or set B or set C.

This is represented in a Venn diagram as shown in Figure C-3. Vertical
lines extend through three circles: the A circle, the B circle, and the C cir-
cle. A union is represented in equations by the symbol .

Copyright International Society of Automation

Figure C-3. P(AUBUC) Union

An intersection of sets (A, B, C) is defined as any event in sets A and B and

C. This is represented in a Venn diagram as shown in Figure C-4. Only one
small area is in all three circles. That area is marked with lines. An intersec-
tion is represented in equations by the symbol .

Figure C-4. Intersection of ABC

Complementary sets are easily shown on Venn diagrams. Since the diagram
represents the entire sample space, all area not enclosed within an event is
the complement of the event set. In Figure C-5, the set A is represented by
a circle. Its complement is set B, represented by the remainder of the dia-
gram.

Mutually exclusive event sets are easily recognized on a Venn diagram. In

Figure C-6, the event sets A and B are shown. There is no common space
within the A circle and the B circle. There is no intersection between A and
B. They cannot happen at the same time and are, therefore, mutually
exclusive.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure C-5. Complementary Events

Figure C-6. Mutually Exclusive Events

Venn diagrams can be useful in control systems reliability engineering

because they can be used to represent the set of all possible failures. Fail-
ures might be categorized according to source; hardware, software, opera-
tion, or maintenance. A Venn diagram illustrating failure categories is
shown in Figure C-7.

Combining Probabilities
Certain rules help to combine probabilities. Combinations of events are
common in the field of reliability evaluation. System failures often occur
only when certain combinations of events happen during certain times.

Independent Event Sets

If the occurrence of an event from set A does not affect the probability of
events from set B, then sets A and B are defined as independent; for exam-
ple, the outcome of one coin toss does not affect the outcome of the next
toss and cannot give us any information about it. When two event sets are

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure C-7. Venn Diagram of Failure Sources

independent, the probability of an event appearing in both set A and set B

(the intersection) is given by the formula:

P(A B) = P(A ) P(B) (C-3)

Independent events are different than mutually exclusive events. Con-

sider two events, A and B, that are mutually exclusive. Knowing that A
has occurred tells us that B cannot occur. If events A and B are indepen-
dent, knowing that A occurred tells us nothing about B. Two events, A
and B, cannot be both mutually exclusive and independent.

EXAMPLE C-1

Problem: Two fair coins are flipped into the air. What is the probability
that both coins will land with heads showing?

Solution: Each coin toss has only two possible outcomes: heads or
tails. Each outcome has a probability of one-half. The coin tosses are
independent. Therefore,

P ( H1 H 2 ) = P ( H1 ) P ( H2 )
1 1
= --- ---
2 2
1
= ---
4
--``,,`,,,`,,`,`,,,``

Copyright International Society of Automation

EXAMPLE C-2

Problem: A pair of fair (properly balanced) dice is rolled. What is the

probability of getting snake eyesone dot on each die?

Solution: The outcome of one die does not affect the outcome of the
other die. Therefore, the events are independent. The probability of
getting one dot can be obtained by noting that there are six sides on
the die and that each side is equally likely. The probability of getting
one dot is one-sixth (1/6). The probability of getting snake eyes is
represented as:

1 1
P ( 1, 1 ) = --- ---
6 6
1
= ------
36

Check the area occupied by the 2 result on Figure C-2. Is that area
equal to 1/36?

EXAMPLE C-3

Problem: A controller fails only if the input power fails and the
controller battery also fails. Assume that these factors are
independent. The probability of input power failure is 0.0001. The
probability of battery failure is 0.01. What is the probability of
controller failure?

Solution: Since input power and battery failure are independent, the
probability of both events is given by Equation C-3:

P ( Cont. Fail ) = P ( Input Power Fail ) P ( Battery Fail )

= 0.0001 0.01
= 0.000001

Probability Summation
If the probability of getting a result from set A equals 0.2 and the probabil-
ity of getting a result from set B equals 0.3, what is the probability of get-
ting a result from either set A or set B?

It would be natural to assume that the answer is 0.5, the sum of the above
probabilities, but that answer is not always correct. Look at the Venn dia-
gram in Figure C-8. If the area of set A (6/36) is added to the area of set B
(6/36), the answer (12/36) is too large. (The answer should be 11/36.)
Since there is an intersection between sets A and B, the area of the intersec-
tion has been counted twice. When summing probabilities, the intersec-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

tions must be subtracted. Thus, the probability of the union of event sets A
and B is given by:

P( A B) = P(A ) + P( B) P(A B) (C-4)

If set A and set B are mutually exclusive so that there is no intersection,
then the following can be stated:

P( A B) = P(A ) + P( B) (C-5)

EXAMPLE C-4

Problem: A pair of fair dice is rolled. What is the probability of getting

a sum of seven?

Solution: A sum of seven dots on the dice can be obtained in a

number of different ways; these are described by the sets {1,6}, {2,5},
{3,4}, {4,3}, {5,2}, and {6,1}. Each specific combination has a
probability of 1/36. The combinations are mutually exclusive;
therefore, Equation C-5 can be used.
1 1 1 1 1 1 1
P ( of 7 ) = ------ + ------ + ------ + ------ + ------ + ------ = ---
36 36 36 36 36 36 6

EXAMPLE C-5

Problem: A control system fails if it experiences a hardware failure, a

software failure, or a maintenance failure. Assume that the failures
are mutually exclusive, since one failure stops the system such that
another failure cannot occur. The probability of a hardware failure is
0.001. The probability of a software failure is 0.01. The probability of
a maintenance failure is 0.00001. What is the probability of controller
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
failure?

Solution: Using Equation C-5,

P ( Cont. Fail ) = P ( HW ) + P ( SW ) + P ( Maint. )

= 0.001 + 0.01 + 0.00001
= 0.01101

Copyright International Society of Automation

EXAMPLE C-6

Problem: A pair of fair dice is rolled. What is the probability of getting

an even number on both dice?

Solution: On each die are six numbers. Three of the numbers are
odd (1, 3, 5) and three of the numbers are even (2, 4, 6). All numbers
are mutually exclusive. Equation C-5 gives the probability of getting
an even number on one die.

P ( Even ) = P ( 2, 4, 6 ) = P ( 2 ) + P ( 4 ) + P ( 6 )
1 1 1
= --- + --- + ---
6 6 6
1
= ---
2

The outcome of one die is independent of the other die. Therefore,

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P ( Even, Even ) = P ( Set A Even ) P ( Set B Even )
1 1
= --- ---
2 2
1
= ---
4

EXAMPLE C-7

Problem: A pair of fair dice is rolled. What is the probability of getting

two dots on either or both dice?

Solution: The probability of getting two dots on die A or B equals 1/6.

The probability of getting two dots on both dice, however, is 1/36. We
can use Equation C-4.

P( A B ) = P( A) + P( B ) P(A B )
1 1 1
= --- + --- -----
-
6 6 36
11
= ------
36

This is evident in Figure C-8, a Venn diagram of the problem.

Copyright International Society of Automation

Figure C-8. Venn Diagram - Example C-8

Conditional Probability
It is often necessary to calculate the probability of some event under spe-

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cific circumstances. The probability of event A, given that event B has
occurred, may need to be calculated. Such a probability is called a condi-
tional probability. The situation can be envisioned by examining the Venn
diagram in Figure C-9.

Figure C-9. Conditional Probability.

Normally, the probability of event A would be given by the area of circle

A divided by the total area. Conditional probability is different. Event B

Copyright International Society of Automation

has occurred. This means that only the state space within the area of circle
B needs to be examined. This is a substantially reduced area! The desired
probability is the area of circle A within circle B, divided by the area of cir-
cle B, expressed by:

P( A B)
P ( A B ) = ------------------------ (C-6)
P( B)

This reads: The probability of A, given B, is equal to the probability of the

intersection of A and B divided by the probability of B. The area of circle A
within circle B represents the probability of the intersection of A and B.
The area of circle B equals the probability of B.

EXAMPLE C-8

Problem: A pair of fair dice is rolled. What is the probability of getting

a two on both dice given that one die has a two?

Solution: The probability of {2,2}, given that one die has a two, is
given by Equation C-6:

1 36
P ( 2, 2 ) = --------------
16
1
= ---
6

In this case, the answer is intuitive since the outcome of each die is inde-
pendent. The problem could have been solved by noting that

1 1
P ( A B ) = P ( A ) P ( B ) = --- ---
6 6

as given in Equation C-3.

Substituting this into Equation C-6:

P(A) P( B)
P ( A B ) = -------------------------------- (C-7)
P(B)
= P(A)

In the example, the result could have calculated by merely knowing the
probability of getting a two on the second die.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

The formula for conditional probability can be rearranged as:

P(A B) = P(A B) P(B) (C-8)

This states that the intersection of events A and B can be obtained by mul-
tiplying the probability of A, given B, times the probability of B. When the
statistics are kept in a conditional format, this equation can be useful.

EXAMPLE C-9

Problem: A pair of fair dice is rolled. What is the probability of getting

a sum of seven, given that one die shows a two?

Solution: There are only two ways to get a sum of seven, given that
one die has a two. Those two combinations are {2,5} and {5,2}. There
are 10 combinations that show a two on exactly one die. These sets
are {2,1}, {2,3}, {2,4}, {2,5}, {2,6}, {1,2}, {3,2}, {4,2}, {5,2}, and {6,2).
Using Equation C-6:

P(A B)
P ( A B ) = -------------------------
P(B)
2 36
= -----------------
10 36

= 2 10

This is shown graphically in Figure C-10.

Figure C-10. Probability - Example C-9

Bayes Rule
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Consider an event A. The state space in which it exists is divided into two
mutually exclusive sections, B and B' (Figure C-11). Event A can be written
as:

A = ( A B ) ( A B' ) (C-9)

Copyright International Society of Automation

Since AB and AB' are mutually exclusive,

P ( A ) = P ( A B ) + P ( A B' ) (C-10)

Substituting Equation C-8,

P ( A ) = P ( A B ) P ( B ) + P ( A B' ) P ( B' ) (C-11)

This states that the probability of event A equals the conditional probabil-
ity of A, given that B has occurred, plus the conditional probability of A,
given that B has not occurred. This is known as Bayes rule. It is widely
used in many aspects of reliability engineering.

Figure C-11. Event A Partitioned

EXAMPLE C-10

Problem: The workday is divided into three mutually exclusive time

periods: day shift, evening shift, and night shift. Day shift lasts 10
hours. Evening shift is eight hours. Night shift is six hours. Logs show
that in the last year (8760 hours), one failure occurred during the day
shift (one failure in 3650 hours), two failures occurred during the
evening shift (two failures in 2920 hours), and seven failures occurred
during the night shift (seven failures in 2190 hours). What is the
overall probability of failure?

Solution: Define event A as failure. Define event B1 as the day shift,

B2 as the evening shift, and B3 as the night shift. The probability of
failure given, event B1 (day shift) is calculated knowing that one
failure occurred in 3650 hours (one-third of the hours in one year). A
variation of Equation C-11 can be used where P(B1) is day shift
probability, P(B2) is the evening shift probability, and P(B3) is the
night shift probability.

P ( fail ) = P ( fail B1 ) P ( B1 ) (C-12)

+ P ( fail B2 ) P ( B2 )
+ P ( fail B3 ) P ( B3 )

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

EXAMPLE C-10 continued

The probabilities of failure for each shift are calculated by dividing the
number of failures during each shift by the number of hours in each
shift. Substituting the numbers into Equation C-12:

P ( fail ) = ------------- ------ + ------------- ------ + ------------- ------

1 10 2 8 7 6
3650 24 2920 24 2190 24

= 0.000114 + 0.000226 + 0.000799

= 0.001139

EXAMPLE C-11

Problem: A company manufactures controllers at two locations. Sixty

percent are manufactured in plant X. Forty percent are manufactured
in plant Y. Controllers manufactured in plant X have a 0.00016
probability of failure in one year. Controllers manufactured in plant Y
have a 0.00022 probability of failure in one year. A purchased
controller can come randomly from either source. What is the
probability of a controller failure?

Solution: Define controller failure as event A. Define event B as plant

X manufacture. Define event B' as plant Y manufacture. Using
Equation C-11, substitute the values to obtain:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

P ( fail ) = ( 0.00016 0.6 ) + ( 0.00022 0.4 )

= 0.000096 + 0.000088 = 0.000184

Permutations and Combinations

Often, probabilities are determined by knowing how many different pos-
sible outcomes exist in an experiment. If all outcomes are equally proba-
ble, then the inverse of the number of different outcomes determines the
probability of any particular outcome. In many cases, the number of dif-
ferent outcomes can be determined by using counting rules.

Basic Counting Rule

If an event is composed by x steps, each having a number of variations
(nx), then the total number of different outcomes equals the product of the
variation count for each step (n1 n2 n3 nx).

Copyright International Society of Automation

EXAMPLE C-12

Problem: How many character strings of length two can be

constructed using the letters A, B, and C if letters cannot be repeated
in any step?

Solution: A character string, two letters in length, takes two steps. In

the first step, any of the three letters can be chosen. In the second
step, either can be chosen from the remaining two letters. The
number of different character strings is:

32=6

This can be verified by creating all the character strings. Starting with
the letter A, they are {A,B}, {A,C}, {B,A}, {B,C}, {C,A}, and {C,B}.

EXAMPLE C-13

Problem: A vendor offers three different models of a controller. On

each model are four communications options: two I/O options and
two memory options. How many different kinds of controllers are
available from this vendor?

Solution: Four steps are required to select the controller. In the first
step, one of three models is selected. One of four communications
options is chosen in the second step. One of two I/O options is
chosen in the third step. One of two memory options is chosen in the
fourth step. Using the first counting rule, the number of variations is:

3 4 2 2 = 48

EXAMPLE C-14

Problem: Four letters, A, B, C, and D, are available to be arranged in

a sequence that is four letters in length. How many different
sequences can be made without repeating a letter in any sequence?

Solution: Four steps are required to build a sequence four letters in

length. Using the basic counting rule, four letters are available in the
first position. Three letters are available for the second position. Two
letters are available for the third position. Only one letter is left for the
last position. Accordingly, the number of different sequences is:

4 3 2 1 = 24
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Permutations
An ordered arrangement of objects without repetition is known as a per-
mutation. The four-letter sequences from Example C-14 are permutations.
The number of permutations of n objects is n! (n! is pronounced n facto-
rial and is the mathematical notation for the product 1 2 3 n 1
n).

EXAMPLE C-15

Problem: The control system selection committee has six members.

The company photographer is coming. How many different ways can
the members be lined up for the committee portrait?

Solution: Since committee members line up without repetition, they

can be arranged in any of n! different permutations:

6! = 720
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

If a subset of the objects is arranged in order, the number of permutations

is reduced. If two items are taken from a set of four, the basic counting rule
tells that 12 permutations exist (4 3). In general, if only some members of
the set are used (quantity r, where r < n), this is written:

P ( n, r ) = n ( n 1 ) ( n 2 ) ( n r + 1 ) (C-13)

If this product is multiplied by

( n r ) ( n r 1 )1
----------------------------------------------------
( n r ) ( n r 1 )1

then

n ( n 1 ) ( n 2 ) ( n r + 1 ) ( n r ) ( n r 1 )1-
P ( n, r ) = -------------------------------------------------------------------------------------------------------------------------- (C-14)
( n r ) ( n r 1 )1
The numerator is n!. The denominator is (n r)!. Therefore:

n!
P ( n, r ) = ------------------ (C-15)
( n r )!
Equation C-15 is used to determine permutations, the number of ways
that r objects from a set of n objects can be arranged in order without repe-
tition.

Copyright International Society of Automation

EXAMPLE C-16

Problem: Given the letters A, B, C, and D, how many permutations

can be obtained with two letters?

Solution: Using Equation C-15, the number of permutations equals

4! 24
P ( 4, 2 ) = -------------------- = ------ = 12
( 4 2 )! 2

This can be verified by listing the permutations: AB, AC, AD, BA, BC,
BD, CA, CB, CD, DA, DB, and DC.

EXAMPLE C-17

Problem: Given the letters A, B, C, D, E, and F, how many different

ways can four letters be arranged where order counts?

Solution: Using Equation C-15, the number of permutations equals

6! 720
P ( 6, 4 ) = -------------------- = ---------- = 360
( 6 4 )! 2

This can be verified by using the basic counting rule. The first step
has six possibilities. The second step has five. The third step has four
possibilities, and the fourth step has three. The basic counting rule
tells us:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

P ( 6, 4 ) = 6 5 4 3 = 360

Combinations
Combinations are groupings of elements in which order does not count.
Since order does not count, that is, when n objects are taken r at a time, the
number of combinations is always less than the number of permutations.
Consider the number of permutations of three letters (3! = 6). They are:
ABC, ACB, BAC, BCA, CAB, and CBA. If order does not count (three
objects are taken three at a time), all these arrangements are the same.
There is only one combination. The number has been reduced by a factor
of 3!.

Consider Example C-16, where the 12 permutations of four letters taken

two at a time were AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, and
DC. If order does not count, AB is not different from BA. AC is not differ-
ent from CA. Eliminating duplicates from the list leaves AB, AC, AD, BC,
BD, and CD. The number has been reduced by 2!.

Copyright International Society of Automation

The formula for combinations of n objects taken r at a time is then

n!
C ( n, r ) = ----------------------- (C-16)
r! ( n r )!

Comparing this formula with Equation C-15, note that the number of per-
mutations is reduced by a factor r! to obtain the number of combinations.

EXAMPLE C-18

Problem: Given the letters A, B, C, D, E, and F, how many

combinations can be obtained using four letters at a time?

Solution: The numbers of different combinations can be found using

Equation C-16.

6! 720
C ( 6, 4 ) = -------------------------- = ---------------- = 15
4! ( 6 4 )! 24 2

Exercises
C.1 Three fair coins are tossed into the air. What is the probability that
all three will land heads up?
C.2 A control loop has a temperature transmitter, a controller, and a
valve. All three devices must operate successfully for the loop to
operate successfully. The loop will be operated for one year before
the process is shut down and overhauled. The temperature trans-
mitter has a probability of failure during the next year of 0.01. The
controller has a probability of failure during the next year of 0.005.
The valve has a probability of failure during the next year of 0.05.
Assume that temperature transmitter, controller, and valve fail-
ures are independent but not mutually exclusive. What is the prob-
ability of failure for the control loop?
C.3 Three fair coins are tossed into the air. What is the probability that
at least one coin will land heads up?
C.4 A pair of fair dice is rolled. If the result is a six or an eight, you hit
Park Place or Boardwalk and go broke. If the result is nine or more,
you pass GO and collect $200.00. What is the probability of going
broke on the next turn? What is the probability of passing GO?
C.5 A control system has four temperature-control loops. The system
operates if three or more loops operate. The system fails if fewer
than three loops operate. How many combinations of loop failures
cause system failure?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

C.6 A vendor evaluation committee has three members. How many

ways can they line up for a committee photograph?
C.7 Five letters, A, B, C, D, and E, are available to be arranged in a
sequence. How many different sequences are possible without
repeating a letter?
C.8 Given the letters A, B, and C, how many combinations can be
obtained using two letters at a time?
C.9 A control system uses two controllers. The system is successful if
one controller is successful. A controller is repairable and has a
steady-state probability of success of 0.95. Each controller, there-
fore, has a steady-state probability of failure that equals 0.05.
Assume that controller failures are independent. What is the prob-
ability of having one failed controller and one successful control-
ler? What is the overall probability of system failure?
C.10 A control system uses three controllers. The system is successful if
two of the three controllers are successful. The system fails if two

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
controllers fail or if all three controllers fail. A controller is repair-
able and has a steady-state probability of success of 0.95. What is
the probability that two controllers are failed and the other control-
ler is successful? How many combinations of two failed controllers
and one successful controller exist? What other combinations
result in system failure? What is the overall probability of system
failure?
C.11 Using the Venn diagram of Figure C-7, estimate the probabilities of
the various failure sources.
C.12 During a year, the probability of getting a dangerous condition in
an industrial boiler is 0.00001. The probability of a safety instru-
mented protection system failing to respond to the dangerous
demand is 0.000001. If there is a dangerous condition AND the
protection system does not respond, there will be a boiler explo-
sion. What is the probability of a boiler explosion?
C.13 The probability of an explosion in an industrial process is 0.00002.
The insurance underwriter wants a safety instrumented system
designed that will reduce the probability of an explosion to
0.0000001. What is the maximum probability of failing danger-
ously allowed in the safety instrumented system?

Copyright International Society of Automation

Answers to Exercises
C.1 Each coin must land heads-up. The probability of that event is .
The combination of all three heads events is = 1/8.
C.2 P(control loop success) = P(transmitter success) P(controller suc-
cess) P(valve success) = (1-0.01)(1-0.005)(1-0.05) =
0.990.9950.95 = 0.9358. P(control loop failure) = 1- 0.9358 =
0.0642. Note that an approximation could be obtained by adding
the failure probabilities. The approximation would be: P(approx.
control loop failure) = 0.01 + 0.005 + 0.05 = 0.065. This probability
summation method is based on equation C-5 and is not exact
because the failures are not mutually exclusive. The exact method
is done by expanding equation C-4. While the approximate
method is not exact, it is usually faster and produces a conserva-
tive result.
C.3 P(at least one heads up) = 1 - P(no heads up) = 1 - 1/8 = 7/8. Alter-
natively, the problem could be solved be creating a list of all com-
binations of three coin toss outcomes. There will be eight possible
combinations. Each combination will be mutually exclusive. It will
be seen that seven of the eight combinations will have at least one
heads up.
C.4 P(going broke) = P(6 or 8) = P(6) + P(8) since outcomes are mutu-
ally exclusive. P(going broke) = 5/36 + 5/36 = 10/36. P(passing
GO) = P(9 or more) = P(9) + P(10) + P(11) + P(12) = 4/36 + 3/36 +
2/36 + 1/36 = 10/36.
C.5 All combinations of one loop operating and two loops operating
represent system failure. Combinations of three or four loops oper-
ating represent system success. Combinations of one loop operat-
ing are 4!/1!(4-1)! = 4. Combinations of two loops operating are
4!/2!(4-2)! = 6. There is a total of 10 combinations of successful/
failed controllers that represent system failure.
C.6 The number of combinations is given by the basic counting rule:
3 2 1 = 6.
C.7 The number of combinations of letters is given by the basic count-
ing rule. In this case: 5 4 3 2 1 = 120.
C.8 Using the formula for combinations: 3!/2!(3-2)! = 3.
C.9 P(one successful controller and one failed controller) = 2 0.95
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

0.05 = 0.095. Remember there are two combinations of one failed

and one successful controller. P(system failure) = P(two failed con-
trollers) = 0.05 0.05 = 0.0025.

Copyright International Society of Automation

C.10 P(one successful controller and two failed controllers)= 3 0.95

0.05 0.05 = 0.007125. Remember that there are three combinations
of two failed controllers and one successful controller. P(three
failed controllers) = 0.000125. P(system failure) = 0.007125 +
0.000125 = 0.00725.
C.11 P(software failures) = 0.60. P(maintenance failures) = 0.20.
P(operational failures) = 0.10. P(hardware failures) = 0.1.
C.12 P(boiler explosion) = P(dangerous condition) P(protection
system failing to respond) = 0.00001 0.000001 = 110-11.
C.13 P(safety system fail dangerously) = P(explosion with safety sys-
tem) / P(explosion without safety system) = 110-7 210-5 =
0.005. As a point of interest, the inverse of this number is known as
the risk reduction factor. In this case the risk reduction factor is
1/0.005 = 200.

Bibliography
1. Johnsonbaugh, R. Essential Discrete Mathematics. NY: Macmillan
Publishing Company, 1987.

2. Ross, S. M. Introduction to Probability Models. Orlando: Academic

Press, Inc., 1987.

3. Paulos, J. A. Innumeracy. New York: Hill and Wang, 1988.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Test Data

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Reliability parameters can be calculated from life test data. Table 4-2
shows accelerated reliability life test data for a set of fifty modules. A
number of variables are used to describe this data. The original number of
modules in the test is denoted by the variable No. The number of modules
surviving after each time period t is denoted by the variable Ns. The
cumulative number of modules that have failed is denoted by the variable
Nf. The reliability function can be calculated as follows:

Ns ( t )
R ( t ) = ------------
- (D-1)
No

The unreliability function is

Nf ( t )
F ( t ) = ------------
- (D-2)
No

The probability density function (PDF) must be calculated in a piecewise

linear manner. For the time period between samples, tn, the formula is

F ( tn ) F ( tn 1 )
f ( t ) = --------------------------------------
t

This equals

1 Nf ( tn ) Nf ( tn 1 )
f ( t ) = ------ -------------------------------------------- (D-3)
No t

435
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
436 Control Systems Safety Evaluation and Reliability

Since the failure rate equals the PDF divided by the reliability function,

f(t) 1
( t ) = ----------- = ----------- f ( t )
R(t) R(t)
No 1 Nf ( tn ) Nf ( tn 1 )
= ----------------------- ------ ---------------------------------------------
Ns ( tn 1 ) No t

Canceling out the Nos, we obtain the result:

1 Nf ( tn ) Nf ( tn 1 )
( t ) = ----------------------- -------------------------------------------- (D-4)
Ns ( tn 1 ) t

Using the data from Table 4-2, at the end of the first week forty-one mod-
ules survived and nine modules failed. The calculations for week one are
shown below.

41
R ( t 1 ) = ------ = 0.82
50

9
F ( t 1 ) = ------ = 0.18
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

1 90
f ( t 1 ) = ------ --------------- = 0.00107 failures/hr
50 24 7

1 90
( t 1 ) = ------ --------------- = 0.00107 failures/hr
50 24 7

The calculations for week two are

36
R ( t 2 ) = ------ = 0.72
50

14
F ( t 2 ) = ------ = 0.28
50

1 14 9
f ( t 2 ) = ------ --------------- = 0.00059 failures/hr
50 24 7

1 14 9
( t 2 ) = ------ --------------- = 0.00072 failures/hr
41 24 7

Table D-1 shows the calculations for the first ten weeks.

Copyright International Society of Automation

Table D-1. Accelerated Life Test Reliability Calculations

Week (tn) Failures R(t) F(t) f(t) Failure Rate
1 9 0.82 0.18 0.00107 0.00107
2 5 0.72 0.28 0.00059 0.00072
3 3 0.66 0.34 0.00036 0.00050
4 2 0.62 0.38 0.00024 0.00036
5 2 0.58 0.42 0.00024 0.00038
6 1 0.56 0.44 0.00012 0.00021
7 2 0.52 0.48 0.00024 0.00043
8 1 0.50 0.50 0.00012 0.00023
9 1 0.48 0.52 0.00012 0.00024
10 0 0.48 0.52 0.00000 0.00000

Figures D-1 to D-4 show graphs of reliability, unreliability, probability of

failure, and failure rate for the entire time period.
Reliability starts at a value of one and decreases. Unreliability starts at a
value of zero and increases. With many types of products, the probability
of failure decreases, then remains relatively constant. The failure rate plot
decreases, remains relatively constant, and then increases.

Figure D-1. Reliability

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure D-2. Unreliability

Figure D-3. Probability of Failure

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Figure D-4. Failure Rate

Censored and Uncensored Data

It should be noted that the exact failure times of any module are not
known from the data in Table 4-2, which contains summary or censored
data. The failure times are known only within a one-week (24 x 7 = 168
hours) period. The resulting sampling error causes noisy data, which
requires some interpretation to be useful. Table D-2 shows the exact mod-
ule failure times for the same life test that was presented in Table 4-2. This
raw data is called uncensored.

The failure rate calculated from Table D-2 is shown in Figure D-5.
Compare this to Figure D-4, which was calculated from summary
(censored) data. The censored data produces a noisy plot. In general, the
more failure time resolution for the data, the better the analysis.

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Table D-2. Exact Failure Times for Modules

Failure Number Time (hrs) Failure Number Time (hrs)
1 12 26 1488
2 26 27 1681
3 42 28 1842
4 57 29 2010
5 75 30 2181
6 95 31 2365
7 116 32 2772
8 139 33 2992
9 163 34 3234
10 189 35 3486
11 219 36 3771
12 251 37 4050
13 289 38 4384
14 333 39 4741
15 383 40 5100
16 454 41 5225
17 502 42 5387
18 566 43 5577
19 650 44 5764
20 740 45 5933
21 832 46 6049
22 937 47 6215
23 1045 48 6365
24 1164 49 6381
25 1312 50 6384

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure D-5. Failure Rate - Uncensored Data

Copyright International Society of Automation

Continuous-
Time Markov
Modeling
When the size of the time increment in a discrete Markov model is
reduced, accuracy is increased as approximations are reduced. Taken to
the limit, the time increment approaches zero. The limit as the time incre-
ment (t) approaches zero is labeled dt. At the limit, we have achieved
continuous time.

lim t = dt
t 0

Using continuous-time Markov models, analytical solutions for time-

dependent state probabilities can be obtained if it is assumed that only
constant failure rates and constant repair rates exist. As time increments
approach zero, transition arcs are labeled with failure rates (failure rate
equals instantaneous failure probability). The notation Sn(t) is used to
indicate time-dependent probability for state n.

Single Nonrepairable Component

The Markov model for a single nonrepairable component is shown in Fig-
ure E-1. An analytical solution for state probabilities for this model can be
developed by using a little logic and a little calculus.

Assume the model starts in state 0 at time t. The model will be in state 0
during the next instant (time = t + t) only if it stays in state 0. This can be
expressed mathematically as:

S 0 ( t + t ) = S 0 ( t ) ( 1 t )

441
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure E-1. Markov Model for Single Non-Repairable Component

This can be rearranged as

S 0 ( t + t ) S 0 ( t ) = S 0 ( t )t

Dividing both sides by t:

S 0 ( t + t ) S 0 ( t )
-------------------------------------------- = S 0 ( t ) (E-1)
t

The left side of Equation E-1 is the deviation with respect to time. Taking
the limit as t approaches zero results in:

dS 0 ( t )
- = S 0 ( t )
--------------- (E-2)
dt

Using a similar process:

dS 1 ( t )
- = S 0 ( t )
--------------- (E-3)
dt

Equations E-2 and E-3 are first-order differential equations with constant
coefficients. One of the easiest ways to solve such equations is to use a
Laplace Transform to convert from the time domain (t) to the frequency
domain (s). Taking the Laplace Transform:

sS 0 ( s ) S 0 ( 0 ) = S o ( s )

and

sS 1 ( s ) S 1 ( 0 ) = S 0 ( s )

Copyright International Society of Automation

Since the system starts in state 0, substitute S0(0) = 1 and S1(0) = 0. This
results in:

sS 0 ( s ) 1 = S 0 ( s ) (E-4)

and

sS 1 ( s ) = S 0 ( s ) (E-5)

Rearranging Equation E-4:

( s + )S 0 ( s ) = 1

Therefore:

1
S 0 ( s ) = ------------ (E-6)
s+

Substituting Equation E-6 into Equation E-5 and solving:

sS 1 ( s ) = ------------
s+

A little algebra gives:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
S 1 ( s ) = -------------------
s(s + )

Taking the inverse transform results in:

t
S0 ( t ) = e (E-7)

and

t
S1 ( t ) = 1 e (E-8)

Since state 0 is the success state, reliability is equal to S0(t) and is given by
Equation E-7. Unreliability is equal to S1(t) and is given by Equation E-8.
This result is identical to the result obtained in Chapter 4 (Equations 4-16
and 4-17) when a component has an exponential probability of failure.
Thus, the Markov model solution verifies the clear relationship between
the constant failure rate and the exponential probability of failure over a
time period.

Copyright International Society of Automation

Single Repairable Component

A single repairable component has the Markov model of Figure E-2. To
develop time-dependent solutions for the state probabilities, assume that
the model starts in state 0. To stay in state 0 during the next instant
(time = t + t), one of two situations must occur. In situation one, the
model must be in state 0 at time t and stay there. In the second situation,
the model must be in state 1 at time t and move to state 0 during t. Math-
ematically, this is written:

S 0 ( t + t ) = S 0 ( t ) ( 1 t ) + S 1 ( t ) ( t )

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This can be rearranged as:

S 0 ( t + t ) S 0 ( t ) = [ S 0 ( t ) + S 1 ( t ) ]t

Dividing both sides of the equation by t and taking the limit as t

approaches zero results in:

dS 0 ( t )
---------------- = S 0 ( t ) + S 1 ( t ) (E-9)
dt

In a similar manner:

dS 1 ( t )
- = S 0 ( t ) S 1 ( t )
--------------- (E-10)
dt

Figure E-2. Markov Model for Single Repairable Component

Copyright International Society of Automation

Equations E-9 and E-10 are first-order differential equations. Again, using
the Laplace Transform solution method:

sS 0 ( s ) S 0 ( 0 ) = S 0 ( s ) + S 1 ( s )

Rearranging:

( s + )S 0 ( s ) = S 1 ( s ) + S 0 ( 0 )

With further algebra:

1
S 0 ( s ) = ------------ S 1 ( s ) + ------------ S 0 ( 0 ) (E-11)
s+ s+

In a similar manner:

1
S 1 ( s ) = ------------ S 0 ( s ) + ------------ S 1 ( 0 ) (E-12)
s+ s+

Substituting Equation E-11 into Equation E-12:

1 1
S 1 ( s ) = ------------ ------------ S 1 ( s ) + ------------ ------------ S 0 ( 0 ) + ------------ S 1 ( 0 )
s + s + s + s + s+

Collecting the terms in a different form:

- -----------
1 ----------- - 1
S ( s ) = ------------ S 1 ( 0 ) + ------------ S 0 ( 0 )
s + s + 1 s+ s+

Creating a common denominator for the left half of the equation yields:

( s + ) ( s + ) 1
----------------------------------------------- S 1 ( s ) = ------------ S 1 ( 0 ) + ------------ S 0 ( 0 )
(s + )(s + ) s+ s+

If both sides of the equation are divided by the first term, the S1(s) term is
isolated.

(s + )(s + ) 1
S 1 ( s ) = ----------------------------------------------- ------------ S 1 ( 0 ) + ------------ S 0 ( 0 )
( s + ) ( s + ) s + s+

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Multiplying the denominator of the first term and canceling out equal
terms:

1
S 1 ( s ) = ----------------------------- [ ( s + )S 1 ( 0 ) + S 0 ( 0 ) ] (E-13)
s(s + + )

To move further with the solution, we must arrange Equation E-13 into a
form that will allow an inverse transform. A partial fraction expansion of
S1(s) where:

A B
S 1 ( s ) = ---- + --------------------- (E-14)
s s++

will work. This means that

A B 1
---- + --------------------- = ----------------------------- [ ( s + )S 1 ( 0 ) + S 0 ( 0 ) ] (E-15)
s s++ s(s + + )

This can be algebraically manipulated into the form:

[ ( s + )S 1 ( 0 ) + S 0 ( 0 ) ] = A ( s + + ) + B ( s )

This relation holds true for all values of s. Therefore, to solve for A and B,
we should pick a value of s that will simplify the algebra as much as possi-
ble. To solve for A, a value of s = 0 is the best choice. At s = 0,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

[ S 1 ( 0 ) + S 0 ( 0 ) ] = A ( + )

Therefore:

A = ------------- [ S 1 ( 0 ) + S 0 ( 0 ) ] (E-16)
+

Solving for B gives the result:

s = ( + )

Substituting for s:

[ S 1 ( 0 ) + S 0 ( 0 ) ] = B ( + )

Copyright International Society of Automation

Rearranging,

1
B = ------------- [ S 1 ( 0 ) S 0 ( 0 ) ] (E-17)
+

Substituting Equations E-16 and E-17 into Equation E-14:

1 1 1
S 1 ( s ) = ------------- --- [ S 1 ( o ) + S 0 ( o ) ] + ------------- --------------------- [ S 1 ( o ) S 0 ( o ) ]
+ s + s + +

Using a similar method for state 0:

1 1 1
S 0 ( s ) = ------------- --- [ S 1 ( o ) + S 0 ( o ) ] + ------------- --------------------- [ S 0 ( o ) S 1 ( o ) ]
+ s + s + +

Taking the inverse Laplace Transform:

( + )t
e
S 0 ( t ) = ------------- [ S 0 ( o ) + S 1 ( o ) ] + ------------------- [ S 0 ( o ) S 1 ( o ) ]
+ +

and

( + )t
e
S 1 ( t ) = ------------- [ S 0 ( o ) + S 1 ( o ) ] + ------------------- [ S 1 ( o ) S 0 ( o ) ]
+ +

Since the system always starts in state 0:

S0 ( o ) = 1

and

S1 ( o ) = 0

Substituting,

( + )t
e
S 0 ( t ) = ------------- + ----------------------- (E-18)
+ +

and

( + )t
e
S 1 ( t ) = ------------- ----------------------- (E-19)
+ +

Copyright International Society of Automation

--``,,`,,,`,,`,

Equations E-18 and E-19 provide the time-dependent analytical formulas

for state probability. In this case, equation E-18 is the formula for availabil-
ity since state 0 is the success state. If t is set equal to infinity in Equations
E-18 and E-19, the second term of each goes to zero. The results are:

S 0 ( ) = ------------- (E-20)
+

and

S 1 ( ) = ------------- (E-21)
+

The limiting state probability is the expected result at infinite time. Thus,
Equations E-20 and E-21 provide this information.

These methods can be used to solve for analytical state probabilities for
more complex models; however, the mathematics become quite complex
for realistic models of several states. The use of numerical techniques in
combination with a discrete-time Markov model is rapidly becoming the
method of choice when time-dependent state probabilities are needed.

Limiting State Probabilities

Analytical formulas for availability and unavailability can be derived
using limiting state probability techniques. For the case of a single repair-
able component, see Figure E-2. The P matrix for this model is

P = 1 (E-22)
1

Limiting state probabilities are obtained by multiplying the matrices:

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

L L 1 = SL SL
S0 S1 0 1
1

This yields:
L L L
( 1 )S 0 + S 1 = S 0

and
L L L
S 0 + ( 1 )S 1 = S 1

Copyright International Society of Automation

Algebraic manipulation of both equations yields the same result:

L L
S 1 = --- S 0

Substituting the above into the relation

L L
S0 + S1 = 1

yields:

L L
S 0 + --- S 0 = 1

Solving and noting that state 0 is the only success state:

L
A ( s ) = S 0 = ------------- (E-23)
+

State 1 is the failure state; therefore, unavailability equals:

L L
U ( s ) = S 1 = 1 S 0 = ------------- (E-24)
+

Fortunately, Equations E-23 and E-24 agree completely with Equations E-

20 and E-21. Both methods yield an analytical solution for limiting state
probability. These results can be used to show one of the classic equations
of reliability theory. Remember that for a single component with a con-
stant failure rate:

1
MTTF = ---

Since a constant repair rate is assumed, MTTR is defined as:

1
MTTR = ---

Substituting into Equation E-23, the well-known equation

MTTF
A ( s ) = ----------------------------------------- (E-25)
MTTF + MTTR

is obtained.
--``,,`,,,`,,`,`,,,```,,,``,``,

Copyright International Society of Automation

Multiple Failure Modes

Continuous analytical solutions can also be obtained for Markov models
with multiple failure states. Figure E-3 shows a single component with
two failure modes.

Again, assume that the model starts in state 0. The model will be in state 0
in the next time instant only if it stays in state 0. This can be expressed
mathematically as:

S D
S 0 ( t + t ) = S 0 ( t ) ( 1 t t )

This can be rearranged as:

S D
S 0 ( t + t ) S 0 ( t ) = S 0 ( t )t S 0 ( t )t
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure E-3. Markov Model for Single Component with Multiple Failure Modes

Dividing both sides by t:

S 0 ( t + t ) S 0 ( t ) S D
-------------------------------------------- = S 0 ( t ) S 0 ( t ) (E-26)
t

Copyright International Society of Automation

The left side of Equation E-1 is the deviation with respect to time. Taking
the limit as t approaches zero results in:

dS 0 ( t ) S D
- = S0 ( t ) S0 ( t )
--------------- (E-27)
dt

Using a similar process:

dS 1 ( t ) S
- = S0 ( t )
--------------- (E-28)
dt

and

dS 2 ( t ) D
- = S0 ( t )
--------------- (E-29)
dt

Equations E-27, E-28, and E-29 are first-order differential equations with
constant coefficients. Using a Laplace Transform to convert from the time
domain (t) to the frequency domain (s):

S D
sS 0 ( s ) S 0 ( 0 ) = S 0 ( s ) S 0 ( s )

S
sS 1 ( s ) S 1 ( 0 ) = S 0 ( s )

and

D
sS 2 ( s ) S 2 ( 0 ) = S 0 ( s )

For the initial conditions S0(0) = 1, S1(0) = 0, and S2(0) = 0, the equations
reduce to:

S D
sS 0 ( s ) 1 = S 0 ( s ) S 0 ( s ) (E-30)

S
sS 1 ( s ) = S 0 ( s ) (E-31)

and

D
sS 2 ( s ) = S 0 ( s ) (E-32)

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

Rearranging Equation E-30:

S D
( s + + )S 0 ( s ) = 1

Therefore:

1
S 0 ( s ) = --------------------------- (E-33)
S D
s+ +

Substituting Equation E-33 into Equations E-31 and E-32 and solving:

S

sS 1 ( s ) = ---------------------------
S D
s+ +

A little algebra gives:

S

S 1 ( s ) = ----------------------------------- (E-34)
S D
s(s + + )

and

D

S 2 ( s ) = ----------------------------------- (E-35)
S D
s(s + + )

Taking the inverse transform of Equation E-33 results in:

S D
( + )t
S0 ( t ) = e = R(t) (E-36)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Partial fractions must be used to proceed with the inverse transform of

Equations E-34 and E-35. Using Equation E-34:

A B
S 1 ( s ) = ---- + --------------------- (E-37)
s s++

This means that:

S
A B
---- + --------------------------- = -----------------------------------
s s + S + D s(s + + )
S D

Copyright International Society of Automation

This can be algebraically manipulated into the form:

S S D
= A(s + + ) + B(s)

S S D
= A( + )

Therefore:

S

A = ------------------- (E-38)
S D
+

Solving for B, choose:

S D
s = ( + )

Substituting for s:

S S D
= B ( + )

Rearranging,
S

B = ------------------- (E-39)
S D
+

Substituting Equations E-38 and E-39 into Equation E-37:

S S

S 1 ( s ) = -------------------------- ---------------------------------------------------------
S D S D S D
s( + ) ( + )(s + + )

Taking the inverse Laplace Transform:

S D
S S ( + )t
e
S 1 ( t ) = ------------------- ------------------------------
S D S D
+ +

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Copyright International Society of Automation

This can be rearranged as:

S
S D
- ( 1 e ( + )t )
S 1 ( t ) = ------------------ (E-40)
S D
+

Similarly:

D
S D
- ( 1 e ( + )t )
S 2 ( t ) = ------------------ (E-41)
S D
+

Equations E-36, E-40, and E-41 provide time-dependent analytical formu-

las for state probability. Figure E-4 shows a plot of probabilities as a func-
tion of time. Note that over long periods of time, failure state probabilities
begin to reach steady-state values.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

Figure E-4. Time Dependent Probabilities for Single Component with Multiple Failure
Modes

Copyright International Society of Automation

1oo1 architecture, 115, 311316, 325, 367368 censored data, 439

1oo1 PLC, 184185 checksum, 192
1oo1D architecture, 325329 common cause, 153, 171, 201, 204, 206207,
1oo2 architecture, 114115, 185, 316320, 324, 210, 212, 215217, 219, 255, 257, 260262,
333, 345, 353, 368 268, 270272, 287, 291, 309, 311, 317, 323,
1oo2 series, 256 331, 333, 335, 340341, 368
1oo2D architecture, 343345, 349352 defense rules, 211, 368
2oo2 architecture, 321325, 333, 339 failures, 2, 201, 204, 207, 211, 213, 215
2oo2D architecture, 339342, 355 216, 218, 260261, 287
2oo3 architecture, 210, 331333, 335336, 338, stress, 201
353, 371 comparison coverage factor, 193
2oo3 voting, 196 comparison diagnostics, 8485, 190, 193, 350
4oo5 system, 135 351, 354
compliance voltage, 39
absorbing failure state, 166 compound interest, 390
accelerated stress testing, 42 constant failure rate, 443
acceptable risk, 360361 continuous-time Markov models, 441
annuity, 393395 corrosive atmospheres, 53, 55, 203
asynchronous functions, 235 coverage factor, 84, 187188, 190, 193, 197,
availability, 1, 6263, 76, 78, 95, 122, 125, 138, 266, 287, 319
143, 151 cumulative distribution function (cdf), 1518,
2326, 48, 6061
curve fitting, 243, 245, 247
basic execution time (BET) model, 245247,
250251
basic faults, 104105, 108 dangerous coverage factor, 84, 188
basic process control system (BPCS), 4 data mismatch, 212
bathtub curve, 7172 degraded mode, 45, 185, 284
beta factor, 213 degraded states, 151, 153, 266
bounding approximations, 145 design faults, 40
Brombacher, A. C., 46 diagnostic coverage factor, 8384, 99, 179, 186,
burner management system (BMS), 202203 188, 190, 263, 278
diagnostics, 76, 84, 90, 95, 99, 153, 171, 179,
182, 185186, 193194, 257, 263, 268
discount rate, 390
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
455
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
456 Control Systems Safety Evaluation and Reliability

discrete time model, 154 ideal dual architecture, 179

diversity, 211212, 214215, 287 IEC 60068-2, 54
divide by zero failure, 224 IEC 61000-4, 55
dynamic memory allocation, 225, 227 IEC 61508, 5, 3536, 41, 79, 94, 213, 227228,
305, 355, 361
electrolytic capacitor, 38 IEC 61511, 5, 78
electrostatic discharge, 34, 37, 45, 53, 55, 71 IEC 68-2-27, 54
environmental failure sources, 211 IEC 68-2-6, 54
event-based failure costs, 384385 imperfect inspection, 255, 273
expected value, 18, 6465, 208 instantaneous failure, 441
exponential distribution, 23 internal failure sources, 40, 47
external failure sources, 37, 40, 46 interrupt, 227, 235236
external watchdog, 315, 325 ISA-71.04, 55
ISA-84.01, 45
ISO 9000-3, 227
failure
databases, 35, 89
on demand, 78, 80, 151, 256 language of money, 2, 379
rate, 436 Laplace transforms, 442
simulation testing, 197 life test data, 435
sources, 3334, 71, 211, 418 lightning, 37, 55, 203
types, 3335, 223 limiting state probability, 159, 161, 164, 448
failure modes and effects analysis (FMEA), 2, 449
8794, 103, 283286, 288, 381, 386 logarithmic Poisson (LP), 245247, 250251
failure modes effects and diagnostic analysis lognormal distribution, 23, 2627
(FMEDA), 3, 87, 9499, 173, 186189, 197,
263, 287, 289, 307308, 312313, 315, 318 maintenance policy, 264, 266, 293294, 298,
319, 322323, 327, 333, 373 311
false trip, 78, 90, 184, 195, 315, 318319, 339, manual inspection, 170, 311
354, 366, 395 manufacturing faults, 40
false trip rate, 354 Markov model, 149156, 161162, 166168,
fast transient, 55 170171, 173174, 181, 183184, 216, 218
fault tolerant system, 45, 84, 88, 149, 201, 204, 219, 258, 266, 272, 275, 280, 290, 292294,
212 298, 300, 311
fault tree analysis (FTA), 2, 103104, 106, 108, matrix inversion, 169, 182, 407408
114, 264, 271 McCabe complexity metric, 232
fault tree review team, 104 mean dead time (MDT), 65
FIT, 68 mean time between failures (MTBF), 6567
FMEDA, 3, 87, 9499, 173, 186189, 197, 263, mean time to fail dangerously (MTTFD), 78,
287, 315 81, 83, 151, 301
foolproofing techniques, 213 mean time to fail safely (MTTFS), 78, 81, 83,
fully repairable systems, 153154, 158, 162 151, 301
functional failures, 35 mean time to failure (MTTF), 1, 59, 63, 6567,
7374, 7778, 81, 127, 131132, 151, 168
hardware diversity, 212 170, 179180, 182, 184, 201, 301, 320, 338,
hazard analysis, 383 411412
hazard rate, 68 mean time to repair (MTTR), 6567, 7677,
highly accelerated life testing (HALT), 69 151, 449
histogram, 11 mechanical shock, 54
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

human reliability analysis, 88 median, 20

humidity, 33, 42, 4446, 5354, 203 MIL-STD-1629, 87
humidity testing, 54 minuteman missile program, 103
MTTFS, 78, 81, 83, 151, 301
MTTR, 6567, 7677, 151, 449

Copyright International Society of Automation

multiple failure modes, 94, 151, 153, 269, 272, reference diagnostics, 84, 190, 228
287, 290, 384, 450 reliability, 435
reliability network, 133
N matrix, 169, 411412 repair rate, 293
nonrepairable component, 124, 152, 441 resulting fault, 105
NORAD, 41 risk, 360361
normal distribution, 20, 2225, 47, 4950 analysis, 365, 384
cost, 1, 359, 365, 383, 395
graph, 361362
on-line repair rate, 293
risk reduction factor (RRF), 1, 78, 184186,
operational profile, 247248, 250251
256259, 261262, 265268, 271, 273, 275,
output readback, 192193
277, 361, 365366, 433
roller coaster curve, 71
parallel system, 130131
partial fraction, 446
safe coverage factor, 84, 187
path space, 230, 234235, 237, 246
safety critical software, 228229
periodic inspection, 257, 265, 273, 275, 277, 383
safety functional verification, 382
periodic inspection interval, 256
safety instrumented system (SIS), 4, 78, 8081,
periodic maintenance, 383
154, 170, 173, 187, 195, 201203, 256, 359
PFD average (PFDavg), 1, 78, 8082, 109, 114
361, 363, 366, 368, 372, 383384, 395396,
116, 184, 256, 258, 261, 265268, 271, 273,
431
275, 277278, 291, 301, 312, 315, 317, 361,
safety integrity levels (SIL), 4, 224, 361362,
367371, 373375
364, 367368, 370
physical failures, 38
safety PLC, 78, 95, 187, 354
physical separation, 213
sample mean, 19, 21
plausibility assertions, 229
sample variance, 21
PLC input circuit, 9597, 188
series system, 73, 307
power system, 56, 107
shock, 43, 46, 5354, 203, 207

--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
pressure transmitter, 39, 110, 203
simplification techniques, 300
priority AND gate, 106
simulation, 49, 51, 72, 152, 168
probability density function (pdf), 1113, 47,
single-board controller, 37
6364, 7273, 208, 435
single-loop controllers, 306
probability of failing safely (PFS), 1, 78, 8081,
software
84, 301, 313, 315, 317321, 323, 325, 328
common-cause strength, 213
330, 334335, 338, 341, 343, 345346, 349,
diagnostics, 227
351, 353354, 396
diversity, 212
probability of failure on demand (PFD), 1, 78,
error, 213
8082, 84, 113114, 116, 151, 172175, 184,
fault avoidance, 226
201, 256257, 259262, 265, 268, 271, 275
maturity model, 227
278, 290291, 301, 311312, 315318, 320
strength, 212, 227
322, 325333, 338340, 343345, 349350,
testing, 227, 248249
353354, 361, 367, 383, 395396
Software Engineering Institute, 227
program flow control, 228
standard deviation, 47
programmable electronic systems (PES), 3, 305
standard normal distribution, 25, 401
programmable logic controllers (PLC), 7879,
starting state, 159, 164, 169
8183, 9597, 110111, 184185, 188, 193
state diagrams, 149
194, 202, 214215, 223224, 306310, 312
state merging, 300
313, 315, 318319, 322323, 333, 353354
state space, 229, 237, 423
statistics, 2, 10, 59, 365, 413
Q matrix, 169, 411412 steady-state availability, 76, 156, 162
steady-state probabilities, 158
random failures, 3537 stress rejecter, 234
random variable, 916, 1820, 22, 26, 5960, stress rejection, 226227, 229
63, 65

Copyright International Society of Automation

stressors, 46, 52, 5455, 204, 211, 229

stuck signal detector, 193
success paths, 122
surge, 37
system architectures, 305
system failure, 113, 195
system FMEA, 381
systematic failures, 35

temperature stress testing, 54

test coverage, 231, 311
thermocouples, 111112, 132, 134135, 366
time-dependent
availability, 164
failure rate, 2, 68, 170
probabilities, 155
training, 214
transition matrix, 155
trigger events, 103105, 108

U.S. space program, 179

U.S. Space Shuttle, 213
unavailability, 449
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---

uncensored data, 439

uniform distribution, 13, 18
unknown failure mode, 188
unreliability, 435
useful approximation, 74, 256

variance, 2122, 24
vibration, 43
virtual infinity, 227, 230, 246
voting circuit, 92, 193, 330

watchdog timer, 190191, 197, 315, 325326,

354
wearout, 52, 71, 170
windowed watchdog, 191, 197

Copyright International Society of Automation

Study Guide CFSE
94% (18)
Study Guide CFSE
100 pages
TR84.00.09-2017 Red
100% (3)
TR84.00.09-2017 Red
117 pages
TÜV Functional Safety Program: WWW - Prosalus.co - Uk
100% (6)
TÜV Functional Safety Program: WWW - Prosalus.co - Uk
267 pages
IEC61508 IEC61511 Presentation E
100% (4)
IEC61508 IEC61511 Presentation E
56 pages
CFSE Exam Sample Questions
91% (11)
CFSE Exam Sample Questions
6 pages
Ida E: Functional Safety Engineering II SIS Design - SIL Verification
100% (4)
Ida E: Functional Safety Engineering II SIS Design - SIL Verification
273 pages
Technical Analysis 1
100% (2)
Technical Analysis 1
79 pages
Control Systems Safety Evaluation and Reliability Third Edition ISA Resources For Measurement and Control PDF
No ratings yet
Control Systems Safety Evaluation and Reliability Third Edition ISA Resources For Measurement and Control PDF
2 pages
Alarm Shelving Exida Ebook PDF
100% (3)
Alarm Shelving Exida Ebook PDF
26 pages
An Introduction To Functional Safety and IEC 61508
100% (4)
An Introduction To Functional Safety and IEC 61508
16 pages
INtools V 7 Tutorial
100% (1)
INtools V 7 Tutorial
286 pages
Safety Instrumented Systems: Global
100% (3)
Safety Instrumented Systems: Global
410 pages
Package Unit Instrumentation Specification - TBE and VDR
50% (2)
Package Unit Instrumentation Specification - TBE and VDR
53 pages
ISA Implementation For SIS
100% (2)
ISA Implementation For SIS
264 pages
Exida - Overview of IEC 61511 PDF
100% (5)
Exida - Overview of IEC 61511 PDF
138 pages
SIF SIL Part I
100% (5)
SIF SIL Part I
108 pages
Pmic Safety Detroit Techday
No ratings yet
Pmic Safety Detroit Techday
61 pages
Sil Explained - Valve World 2009 PDF
100% (1)
Sil Explained - Valve World 2009 PDF
5 pages
Abhisam Quick Guide Functional Safety and SIL
75% (4)
Abhisam Quick Guide Functional Safety and SIL
47 pages
03 Functional Safety Basics
100% (1)
03 Functional Safety Basics
39 pages
ISA - TR84.00.02 - Safety - Instrument PT 1 PDF
No ratings yet
ISA - TR84.00.02 - Safety - Instrument PT 1 PDF
108 pages
Isa TR 84.00.03
100% (3)
Isa TR 84.00.03
222 pages
Functional Safety Practices For Operations
100% (4)
Functional Safety Practices For Operations
14 pages
ISA TR84.00.03 Guidance For Testing of Process Sector Safety Instrumented Functions
100% (1)
ISA TR84.00.03 Guidance For Testing of Process Sector Safety Instrumented Functions
222 pages
ISA 84 - The Weaknes PDF
100% (2)
ISA 84 - The Weaknes PDF
74 pages
exSILentia v4 User Guide
No ratings yet
exSILentia v4 User Guide
284 pages
ISA-84.00.02 Part 2 (2002)
No ratings yet
ISA-84.00.02 Part 2 (2002)
44 pages
Proof Test Guide
100% (2)
Proof Test Guide
34 pages
ISA TR18 2 1 2018 - TOC Excerpt
50% (2)
ISA TR18 2 1 2018 - TOC Excerpt
9 pages
Sil As Per Iec
No ratings yet
Sil As Per Iec
157 pages
Certificate Exida Certificate Fmeda Virgo Europe en 86996 PDF
No ratings yet
Certificate Exida Certificate Fmeda Virgo Europe en 86996 PDF
27 pages
Competence For Functional Safety
100% (1)
Competence For Functional Safety
25 pages
CFSP Example Questions
No ratings yet
CFSP Example Questions
5 pages
FSE 1 Rev 3.4 Slides (Modo de Compatibilidade)
No ratings yet
FSE 1 Rev 3.4 Slides (Modo de Compatibilidade)
299 pages
Isa-Tr84 00 02-2015
No ratings yet
Isa-Tr84 00 02-2015
136 pages
ISA-TR84.00.02-2002 - Part 1
No ratings yet
ISA-TR84.00.02-2002 - Part 1
108 pages
Example Alarm Philosophy
33% (3)
Example Alarm Philosophy
3 pages
Proof Testing Safety Instrumented Systems: Prasad Goteti May 10, 2018
100% (2)
Proof Testing Safety Instrumented Systems: Prasad Goteti May 10, 2018
33 pages
Sil Verification
0% (1)
Sil Verification
3 pages
Example Results
100% (1)
Example Results
5 pages
Isa s84.01 Application of Safety Instrumented Systems For Process Industries
100% (6)
Isa s84.01 Application of Safety Instrumented Systems For Process Industries
110 pages
Achieving Hardware Fault Tolerance
100% (2)
Achieving Hardware Fault Tolerance
16 pages
White Paper Key Variables Needed For PFDavg Calculation Feb 2018 Rev2.1
No ratings yet
White Paper Key Variables Needed For PFDavg Calculation Feb 2018 Rev2.1
16 pages
Catalog Functional Safety Solutions Asco en 5084612
No ratings yet
Catalog Functional Safety Solutions Asco en 5084612
16 pages
ISA-TR84.00.03-2012 Mech Integrity SIS
No ratings yet
ISA-TR84.00.03-2012 Mech Integrity SIS
152 pages
RM70 FMEDA Exida - 1
100% (1)
RM70 FMEDA Exida - 1
22 pages
ExSILentia User Guide
100% (1)
ExSILentia User Guide
180 pages
Exida exSILentia User Guide PDF
No ratings yet
Exida exSILentia User Guide PDF
168 pages
Key Variables Needed For PFDavg Calculation
No ratings yet
Key Variables Needed For PFDavg Calculation
15 pages
SIS Expert Answers 5 Key User Questions
No ratings yet
SIS Expert Answers 5 Key User Questions
3 pages
BMS Exida White Paper
No ratings yet
BMS Exida White Paper
13 pages
What Does Proven in Use Imply
No ratings yet
What Does Proven in Use Imply
12 pages
PART 2-A SIS Applications in Boilers
No ratings yet
PART 2-A SIS Applications in Boilers
29 pages
FRT 22-10-010 V1R1 R008 IEC 61508 Assessment Report - EPC and SRCP
No ratings yet
FRT 22-10-010 V1R1 R008 IEC 61508 Assessment Report - EPC and SRCP
18 pages
Safety Instrumented Systems (SIS), Safety Integrity Levels (SIL), IEC61508, and Honeywell Field Instruments
No ratings yet
Safety Instrumented Systems (SIS), Safety Integrity Levels (SIL), IEC61508, and Honeywell Field Instruments
21 pages
Hima Fail Safe Pes
No ratings yet
Hima Fail Safe Pes
17 pages
Functional Safety - IEC 61511 Introduction: New Plymouth, 11 April 2013
100% (4)
Functional Safety - IEC 61511 Introduction: New Plymouth, 11 April 2013
74 pages
Webinar - An End To End Functional Safety Solution For Automotive Ics Based On Iso 26262
No ratings yet
Webinar - An End To End Functional Safety Solution For Automotive Ics Based On Iso 26262
17 pages
A Lopa Implementation Method: Breydon G Morton Dupont October 3, 2007
100% (4)
A Lopa Implementation Method: Breydon G Morton Dupont October 3, 2007
19 pages
Smartline Transmitter PDF
No ratings yet
Smartline Transmitter PDF
18 pages
Emerson Asco Low Power Solenoid Valves
No ratings yet
Emerson Asco Low Power Solenoid Valves
13 pages
Proof Testing of Safety Instrumented Systems in The Onshore Chemical
No ratings yet
Proof Testing of Safety Instrumented Systems in The Onshore Chemical
13 pages
Pressure Relief Valves: Operation, Maintenance and Adjustments
No ratings yet
Pressure Relief Valves: Operation, Maintenance and Adjustments
15 pages
Gland Install
100% (1)
Gland Install
2 pages
b09-ISA-TR84 00 02
No ratings yet
b09-ISA-TR84 00 02
115 pages
Flowserve Safety Manual MaxFlo4 NR R03 20240411 DRAFT Version (Repariert)
No ratings yet
Flowserve Safety Manual MaxFlo4 NR R03 20240411 DRAFT Version (Repariert)
11 pages
FMEDA - HONEYWELL - ST3000 - Pressure Transmitter - v111
No ratings yet
FMEDA - HONEYWELL - ST3000 - Pressure Transmitter - v111
17 pages
Functional Safety Manual RN221N: Active Barrier
No ratings yet
Functional Safety Manual RN221N: Active Barrier
16 pages
The Improvement of Sil Calculation Methodology
No ratings yet
The Improvement of Sil Calculation Methodology
16 pages
Asco Jpis8551b301 24VDC
No ratings yet
Asco Jpis8551b301 24VDC
14 pages
Control Systems Safety Evaluation and Reliability Third Edition William M. Goble 2024 Scribd Download
100% (11)
Control Systems Safety Evaluation and Reliability Third Edition William M. Goble 2024 Scribd Download
71 pages
Alarms Trips: The Ups and Downs
No ratings yet
Alarms Trips: The Ups and Downs
8 pages
Medini Analyze Key Facts
No ratings yet
Medini Analyze Key Facts
4 pages
Mokveld SIL3 Datasheet V - 484 - 01 - 15 - en - El
No ratings yet
Mokveld SIL3 Datasheet V - 484 - 01 - 15 - en - El
2 pages
GEO 17-04-048 R002 V1R1 P2oo2 TMA Assessment
No ratings yet
GEO 17-04-048 R002 V1R1 P2oo2 TMA Assessment
15 pages
Asco 141 Arcs Adv Redundant Control Sys Global en 5312316
No ratings yet
Asco 141 Arcs Adv Redundant Control Sys Global en 5312316
20 pages
1系列安全变送器FMEDA报告
No ratings yet
1系列安全变送器FMEDA报告
41 pages
Safety Integrity Level (SIL) Verification For SLAC Radiation Safety Systems
No ratings yet
Safety Integrity Level (SIL) Verification For SLAC Radiation Safety Systems
4 pages
As 61508.5-2011 Functional Safety of Electrical Electronic Programmable Electronic Safety-Related Systems Exa
No ratings yet
As 61508.5-2011 Functional Safety of Electrical Electronic Programmable Electronic Safety-Related Systems Exa
10 pages
PMT Hps 34 ST 25 37 Pressure Safety Manual
No ratings yet
PMT Hps 34 ST 25 37 Pressure Safety Manual
16 pages
Functional Safety For Mine Hoist-From Lilly To SIL3 Hoist Protector
No ratings yet
Functional Safety For Mine Hoist-From Lilly To SIL3 Hoist Protector
20 pages
#4 Understanding The How Why and What of A Safety Integrity Level (SIL)
No ratings yet
#4 Understanding The How Why and What of A Safety Integrity Level (SIL)
2 pages
Manual Fisher ED
No ratings yet
Manual Fisher ED
8 pages
PFD (In) Sensitivity To PTC
No ratings yet
PFD (In) Sensitivity To PTC
3 pages
Reliability Theory and Practice
From Everand
Reliability Theory and Practice
Igor Bazovsky
4/5 (1)
Guidelines for Auditing Process Safety Management Systems
From Everand
Guidelines for Auditing Process Safety Management Systems
CCPS (Center for Chemical Process Safety)
No ratings yet
Bow Ties in Risk Management: A Concept Book for Process Safety
From Everand
Bow Ties in Risk Management: A Concept Book for Process Safety
CCPS (Center for Chemical Process Safety)
No ratings yet
Guidelines for Engineering Design for Process Safety
From Everand
Guidelines for Engineering Design for Process Safety
CCPS (Center for Chemical Process Safety)
5/5 (2)
Guidelines for Initiating Events and Independent Protection Layers in Layer of Protection Analysis
From Everand
Guidelines for Initiating Events and Independent Protection Layers in Layer of Protection Analysis
CCPS (Center for Chemical Process Safety)
5/5 (1)
Guidelines for the Management of Change for Process Safety
From Everand
Guidelines for the Management of Change for Process Safety
CCPS (Center for Chemical Process Safety)
No ratings yet
Incidents That Define Process Safety
From Everand
Incidents That Define Process Safety
CCPS (Center for Chemical Process Safety)
No ratings yet
Guidelines for Safe Automation of Chemical Processes
From Everand
Guidelines for Safe Automation of Chemical Processes
CCPS (Center for Chemical Process Safety)
No ratings yet

Control Systems Safety Evaluation and Reliability (Recommend)

Uploaded by

Control Systems Safety Evaluation and Reliability (Recommend)

Uploaded by

Control Systems

Copyright International Society of Automation

Copyright International Society of Automation

Copyright International Society of Automation

Copyright 2010 International Society of Automation

All rights reserved.

Printed in the United States of America.

No part of this work may be reproduced, stored in a retrieval system, or transmitted

Library of Congress Cataloging-in-Publication Data

Copyright International Society of Automation

Control System Documentation: Applying Symbols and Identification, 2nd

Copyright International Society of Automation

Copyright International Society of Automation

Several persons made significant improvements to the document as part of the

Copyright International Society of Automation

ABOUT THE AUTHOR xvii

Chapter 2 UNDERSTANDING RANDOM EVENTS 9

Chapter 3 FAILURES: STRESS VERSUS STRENGTH 33

Copyright International Society of Automation

Chapter 4 RELIABILITY AND SAFETY 59

Chapter 5 FMEA / FMEDA 87

Chapter 6 FAULT TREE ANALYSIS 103

Chapter 7 RELIABILITY BLOCK DIAGRAMS 121

Copyright International Society of Automation

Chapter 8 MARKOV MODELING 149

Chapter 9 DIAGNOSTICS 179

Chapter 10 COMMON CAUSE 201

Chapter 11 SOFTWARE RELIABILITY 223

Software Reliability Model Assumptions, 248

Chapter 12 MODELING DETAIL 255

Copyright International Society of Automation

Chapter 13 RELIABILITY AND SAFETY MODEL CONSTRUCTION 283

Chapter 14 SYSTEM ARCHITECTURES 305

Chapter 15 SAFETY INSTRUMENTED SYSTEMS 359

Chapter 16 LIFECYCLE COSTING 379

APPENDIX A STANDARD NORMAL DISTRIBUTION TABLE 401

Copyright International Society of Automation

APPENDIX B MATRIX MATH 405

APPENDIX C PROBABILITY THEORY 413

APPENDIX D TEST DATA 435

APPENDIX E CONTINUOUS TIME MARKOV MODELING 441

Copyright International Society of Automation

Copyright International Society of Automation

The ability to numerically evaluate control system design parameters, like

The ISA-84.01 standard defines quantitative performance levels for safety

As with many areas of engineering, it must be realized that system safety

In spite of the limitations of variability, imprecision, simplified assump-

Copyright International Society of Automation

He holds a B.S. in electrical engineering from Penn State and an M.S. in

He is a well-known speaker and consultant and also develops and teaches

He is a Fellow Member of the International Society of Automation (ISA)

Copyright International Society of Automation

Control System Safety and Reliability

Reliability and safety are measured using a number of well-defined

Reliability engineering is built upon a foundation of probability and statis-

Copyright International Society of Automation

questions, such as: Which configuration is more reliable on an airplane,

The level of complexity in our control systems has continued to increase,

computer-based sensors to world-wide computer communication net-

Copyright International Society of Automation

Several of these international standards play an important role in the

Safety Integrity Level Probability of Failure Risk Reduction Factor

4 10 4  PFDavg q 10 5 10000 a $R  100000

3 10 3  PFDavg q 10 4 1000 a $R  10000

2 10 2  PFDavg q 10 3 100 a $R  1000

Figure 1-1. Safety Integrity Levels (SIL)

ISA-84.01 also pioneered the concept of a safety life-cycle, a systematic

Copyright International Society of Automation

Figure 1-2. Simplified Safety Life-cycle (SLC)

The original ISA-84.01-1996 standard has been replaced by the updated

Qualitative versus Quantitative

Copyright International Society of Automation

Quantitative safety and reliability evaluation is a growing science. Knowl-

Copyright International Society of Automation

2. Barlow, R. E. Mathematical Theory of Reliability: A Historical

3. IEC 62380 Electronic Components Failure Rates. Geneva: Interna-

4 10 4 PFDavg q 10 5 10000 a $R 100000

3 10 3 PFDavg q 10 4 1000 a $R 10000

2 10 2 PFDavg q 10 3 100 a $R 1000