Design Reliability Fundamentals
Design Reliability Fundamentals
RELIABILITY
Fundamentals and Applications
B. S. Dhillon
CRC Press
Boca Raton London New York Washington, D.C.
Acquiring Editor:
Project Editor:
Cover design:
Cindy Carelli
Susan Fox
Dawn Boyd
99-28211
CIP
This book contains information obtained from authentic and highly regarded sources. Reprinted
material is quoted with permission, and sources are indicated. A wide variety of references are listed.
Reasonable efforts have been made to publish reliable data and information, but the author and the
publisher cannot assume responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, microfilming, and recording, or by any information
storage or retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion,
for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press
LLC for such copying.
Direct all inquiries to CRC Press LLC, 2000 Corporate Blvd., N.W., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are only used for identification and explanation, without intent to infringe.
1999 by CRC Press LLC
No claim to original U.S. Government works
International Standard Book Number 0-8493-1465-8
Library of Congress Card Number 99-28211
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
Printed on acid-free paper
Dedication
This book is affectionately dedicate to my mother
Udham Kaur.
Preface
Today at the dawn of the twenty-first century, we are designing and building new
complex and sophisticated engineering systems for use not only on Earth but to
explore other heavenly bodies. Time is not that far off when humankind will permanently reside on other planets and explore stars outside our solar system. Needless
to say, the reliability of systems being used for space explorations and for use on
Earth is becoming increasingly important because of factors such as cost, competition, public demand, and usage of untried new technology. The only effective way
to ensure reliability of engineering systems is to consider the reliability factor
seriously during their design.
Over the years as engineering systems have become more complex and sophisticated, the knowledge in reliability engineering has also increased tremendously
and it has specialized into areas such as human reliability, software reliability,
mechanical reliability, robot reliability, medical device reliability, and reliability and
maintainability management.
Even though there are a large number of texts already directly or indirectly
related to reliability engineering, there is still a need for an up-to-date book emphasizing design reliability with respect to specialized, application, and related areas.
Thus, the main objective of writing this book is to include new findings as well as
tailor the text in such a manner so that it effectively satisfies the needs of modern
design reliability. Therefore, this book is written to meet this challenge and its
emphasis is on the structure of concepts rather than on mathematical rigor and minute
details. However, the topics are treated in such a manner that the reader needs no
previous knowledge to understand them. Also, the source of most of the material
presented in the book is given in references if the reader wishes to delve deeper into
particular topics. The book contains a large number of examples along with their
solutions, and at the end of each chapter there are numerous problems to test reader
comprehension.
The book is composed of 17 chapters. Chapter 1 presents the historical aspect
of reliability engineering, the need of reliability in engineering design, reliability in
the product design process, and important terms and definitions, and information
sources. Chapter 2 reviews mathematics essential to understanding subsequent chapters. Fundamental aspects of engineering design and reliability management are
presented in Chapter 3. As the failure data collection and analysis are considered
the backbone of reliability engineering, Chapter 4 presents many associated aspects.
Chapter 5 presents many basic reliability evaluation and allocation methods. However, the emphasis of the chapter is on the basic reliability evaluation methods.
Chapters 6 and 7 describe in detail two most widely used methods (i.e., failure
modes and effect analysis and fault tree analysis, respectively) to evaluate engineering design with respect to reliability. Chapter 8 presents two important topics of
reliability engineering: common-cause failures and three state devices. Two specialized areas of reliability, mechanical reliability and human reliability, are discussed
in Chapters 9 and 10, respectively.
Chapter 11 presents the topics of reliability testing and growth essential in the
design phase of an engineering system. Chapters 12 through 14 present three application areas of reliability, i.e., reliability in computer systems, robot reliability, and
medical device reliability, respectively. In particular, the emphasis of Chapter 12 is
on computer software reliability. Chapter 15 presents important aspects of design
maintainability and reliability centered maintenance. Chapters 16 and 17 describe
three topics directly or indirectly related to design reliability: total quality management, risk assessment, and life cycle costing.
The book is intended primarily for senior level undergraduate and graduate
students, professional engineers, college and university level teachers, short reliability
course instructors and students, researchers, and engineering system design managers.
The author is deeply indebted to many individuals including colleagues, students,
friends, and reliability and maintainability professionals for their invisible inputs
and encouragement at the moment of need. I thank my children Jasmine and Mark
for their patience and intermittent disturbances leading to desirable coffee and other
breaks. And last, but not least, I thank my boss, friend, and wife, Rosy, for typing
various portions of this book and other related materials, and for her timely help in
proofreading.
B.S. Dhillon
Ottawa, Ontario
The Author
B.S. Dhillon, Ph.D., is a professor of Mechanical Engineering at the University of
Ottawa. He has served as a Chairman/Director of Mechanical Engineering Department/Engineering Management Program for 11 years at the same institution. He has
published over 260 articles on Reliability and Maintainability Engineering and on
related areas. In addition, he has written 20 books on various aspects of system
reliability, safety, human factors, design, and engineering management published by
Wiley (1981), Van Nostrand (1982), Butterworth (1983), Marcel Dekker (1984), and
Pergamon (1986) among others. His books on Reliability have been translated into
several languages including Russian, Chinese, and German. He has served as General
Chairman of two international conferences on reliability and quality control held in
Los Angeles and Paris in 1987. Also, he is or has been on the editorial board of five
international journals.
Dr. Dhillon is a recipient of the American Society of Quality Control Austin J.
Bonis Reliability Award, the Society of Reliability Engineers Merit Award, the Gold
Medal of Honor (American Biographical Institute), and Faculty of Engineering
Glinski Award for excellence in Reliability Engineering Research. He is a registered
Professional Engineer in Ontario and is listed in the American Men and Women of
Science, Men of Achievements, International Dictionary of Biography, Whos Who
in International Intellectuals, and Whos Who in Technology.
At the University of Ottawa, he has been teaching reliability and maintainability
for over 18 years. Dr. Dhillon attended the University of Wales where he received
a B.S. in electrical and electronic engineering and an M.S. in mechanical engineering.
He received a Ph.D. in industrial engineering from the University of Windsor.
Table of Contents
CHAPTER l: Introduction
l.1
Reliability History
1.2
Need of Reliability in Product Design
1.3
Reliability in the Product Design Process
1.4
Reliability Specialized and Application Areas
1.5
Terms and Definitions
1.6
Reliability Information Sources
1.7
Military and Other Reliability Documents
1.8
Problems
1.9
References
2.7
2.8
2.9
2.10
4.8
4.9
4.10
4.11
CHAPTER 5:
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
6.4
6.5
6.6
6.7
6.8
6.9
6.10
8.4
8.5
8.3.3
Reliability Optimization of a Series Network
8.3.4
Reliability Optimization of a Parallel Network
Problems
References
10.8.1
10.8.2
12.9 Problems
12.10 References
CHAPTER 13: Robot Reliability and Safety
13.1 Introduction
13.2 Robot Safety
13.2.1 Robot Accidents
13.2.2 Robot Hazards and Safety Problems
13.2.3 Safety Considerations in Robot Life Cycle
13.2.4 Robot Safeguard Approaches
13.2.5 Human Factor Issues in Robotic Safety
13.3 Robot Reliability
13.3.1 Causes and Classifications of Robot Failures
13.3.2 Robot Reliability Measures
13.3.3 Robot Reliability Analysis and Prediction Methods
13.3.4 Reliability and Availability Analyses of a Robot System
Failing with Human Error
13.3.5 Reliability Analysis of a Repairable/Non-Repairable
Robot System
13.4 Problems
13.5 References
CHAPTER 14: Medical Device Reliability
14.1 Introduction
14.2 Facts and Figures, Government Control and Liability
14.3 Medical Electronic Equipment Classification
14.4 Medical Device Recalls
14.5 Medical Device Design Quality Assurance
14.5.1 Organizations
14.5.2 Specifications
14.5.3 Design Review
14.5.4 Reliability Assessment
14.5.5 Parts/Materials Quality Control
14.5.6 Software Quality Control
14.5.7 Design Transfer
14.5.8 Labeling
14.5.9 Certification
14.5.10 Test Instrumentation
14.5.11 Manpower
14.5.12 Quality Monitoring Subsequent to the Design Phase
14.6 Human Error Occurrence and Related Human Factors
14.6.1 Control/Display Related Human Factors Guidelines
14.6.2 Medical Device Maintainability Related Human Factor
Problems
14.6.3
14.7
14.8
14.9
14.10
14.11
14.12
14.13
16.2.1
16.2.2
16.2.3
16.2.4
16.2.5
16.2.6
16.3
16.4
16.5
Introduction
typical Boeing 747 jumbo jet airplane is made up of approximately 4.5 million parts,
including fasteners. Even for relatively simpler products, there has been a significant
increase in complexity with respect to parts. For example, in 1935 a farm tractor was
made up of 1200 critical parts and in 1990 the number increased to around 2900.
With respect to cost effectiveness, many studies have indicated that the most
effective for profit contribution is the involvement of reliability professionals with
product designers. In fact, according to some experts, if it would cost $1 to rectify
a design defect prior to the initial drafting release, the cost would increase to $10 after
the final release, $100 at the prototype stage, $1000 at the pre-production stage, and
$10,000 at the production stage. Nonetheless, various studies have revealed that
design-related problems are generally the greatest causes for product failures. For
example, a study performed by the U.S. Navy concerning electronic equipment
failure causes attributed 43% of the failures to design, 30% to operation and maintenance, 20% to manufacturing, and 7% to miscellaneous factors [12].
Well-publicized system failures such as those listed below may have also contributed to more serious consideration of reliability in product design [13-15]:
Space Shuttle Challenger Disaster: This debacle occurred in 1986, in
which all crew members lost their lives. The main reason for this disaster
was design defects.
Chernobyl Nuclear Reactor Explosion: This disaster also occurred in
1986, in the former Soviet Union, in which 31 lives were lost. This debacle
was also the result of design defects.
Point Pleasant Bridge Disaster: This bridge located on the West Virginia/Ohio border collapsed in 1967. The disaster resulted in the loss of
46 lives and its basic cause was the metal fatigue of a critical eye bar.
Mechanical reliability. This is concerned with the reliability of mechanical items. Many textbooks and other publications have appeared on this
topic. A comprehensive list of publications on mechanical reliability is
given in Reference 17.
Software reliability. This is an important emerging area of reliability as
the use of computers is increasing at an alarming rate. Many books have
been written on this topic alone. A comprehensive list of publications on
software reliability may be found in References 10 and 18.
Human reliability. In the past, many times systems have failed not due
to technical faults but due to human error. The first book on the topic
appeared in 1986 [18]. A comprehensive list of publications on human
reliability is given in Reference 19.
Reliability optimization. This is concerned with the reliability optimization of engineering systems. So far, at least one book has been written on
the topic and a list of publications on reliability optimization is given in
References 20 and 21.
Reliability growth. This is basically concerned with monitoring reliability
growth of engineering systems during their design and development. A
comprehensive list of publications on the topic is available in Reference 10.
Structural reliability. This is concerned with the reliability of engineering structures, in particular civil engineering. A large number of publications including books have appeared on the subject [11].
Reliability general. This includes developments on reliability of a general
nature. Usually, mathematicians and related professionals contribute to
the area [10].
Power system reliability. This is a well-developed area and is basically
concerned with the application of reliability principles to conventional
power system related problems. Many books on the subject have appeared
over the years including a vast number of other publications [22].
Robot reliability and safety. This is an emerging new area of the application of basic reliability and safety principles to robot associated problems. Over the years many publications on the subject have appeared
including one textbook [23].
Life cycle costing. This is an important subject that is directly related to
reliability. In particular, when estimating the ownership cost of the product,
the knowledge regarding its failure rate is essential. In the past, many publications on life cycle costing have appeared including several books [24].
Maintainability. This is closely coupled to reliability and is concerned
with the maintaining aspect of the product. Over the years, a vast number
of publications on the subject have appeared including some books [10].
Reliability. This is the probability that an item will carry out its assigned
mission satisfactorily for the stated time period when used under the
specified conditions.
Failure. This is the inability of an item to function within the initially
defined guidelines.
Downtime. This is the time period during which the item is not in a
condition to carry out its stated mission.
Maintainability. This is the probability that a failed item will be repaired
to its satisfactory working state.
Redundancy. This is the existence of more than one means for accomplishing a defined function.
Active redundancy. This is a type of redundancy when all redundant
items are operating simultaneously.
Availability. This is the probability that an item is available for application
or use when needed.
Mean time to failure (exponential distribution). This is the sum of the
operating time of given items divided by the total number of failures.
Useful life. This is the length of time an item operates within an acceptable
level of failure rate.
Mission time. This is the time during which the item is performing its
specified mission.
Human error. This is the failure to perform a given task (or the performance of a forbidden action) that could lead to disruption of scheduled
operations or result in damage to property/equipment.
Human reliability. This is the probability of completing a job/task successfully by humans at any required stage in the system operation within
a defined minimum time limit (if the time requirement is specified).
Reliability Engineering and System Safety, published by Elsevier Science Publishers several times a year.
Reliability Review, published by the Reliability Division of ASQC four
times a year.
2. Conference Proceedings
Proceedings of the Annual Reliability and Maintainability Symposium
(U.S.)
Proceedings of the Annual Reliability Engineering Conference for the
Electric Power Industry (U.S.)
Proceedings of the European Conference on Safety and Reliability
(Europe)
Proceedings of the Symposium on Reliability in Electronics (Hungary)
Proceedings of the International Conference on Reliability, Maintainability, and Safety (China)
Proceedings of the International Conference on Reliability and Exploitation of Computer Systems (Poland)
3. Agencies
Reliability Analysis Center
Rome Air Development Center (RADC)
Griffis Air Force Base
New York, NY 13441-5700
Government Industry Data Exchange Program (GIDEP)
GIDEP Operations Center
U.S. Department of Navy
Naval Weapons Station
Seal Beach
Corona, CA 91720
National Aeronautics and Space Administration (NASA) Parts Reliability
Information Center
George C. Marshall Space Flight Center
Huntsville, AL 35812
National Technical Information Center (NTIS)
5285 Port Royal Road
Springfield, VA 22151
Defense Technical Information Center
DTIC-FDAC
8725 John J. Kingman Road, Suite 0944
Fort Belvoir, VA 22060-6218
American National Standards Institute (ANSI)
11 W. 42nd St.
New York, NY 10036
Technical Services Department
American Society for Quality Control
611 W. Wisconsin Ave., P.O. Box 3005
Milwaukee, WI 53201-3005
1.8 PROBLEMS
1. Write an essay on the history of reliability.
2. Discuss the need for reliability in product design.
3. Discuss at least three well-publicized system failures and the reasons for
their occurrence.
4. What are the reliability related tasks performed during the product design
process?
5. Describe the following two specialized areas of reliability:
Human reliability
Software reliability
6. What are the differences between reliability and maintainability?
7. Define the following terms:
Failure rate
Redundancy
Failure
8. Describe the following two agencies for obtaining reliability related information:
Reliability Analysis Center
GIDEP
9. In your opinion, what are the five U.S. military reliability related documents that are the most important during the design process? Give reasons
for your selection.
10. Make comparisons between mechanical reliability and power system reliability.
1.9 REFERENCES
1. Lyman, W.J., Fundamental consideration in preparing a master system plan, Electrical
World, 101, 778-792, 1933.
2. Smith, S.A., Service reliability measured by probabilities of outage, Electrical World,
103, 371-374, 1934.
3. Dhillon, B.S., Power System Reliability, Safety and Management, Ann Arbor Science
Publishers, Ann Arbor, MI, 1983.
4. Coppola, A., Reliability engineering of electronic equipment: a historical perspective,
IEEE Trans. Reliability, 33, 29-35, 1984.
5. Weibull, W., A statistical distribution function of wide applicability, J. Appl. Mech.,
18, 293-297, 1951.
6. Davis, D.J., An analysis of some failure data, J. Am. Statist. Assoc., 47, 113-150, 1952.
7. Henney, K., Ed., Reliability Factors for Ground Electronic Equipment, McGraw-Hill,
New York, 1956.
8. AGREE Report, Advisory Group on Reliability of Electronic Equipment (AGREE),
Reliability of Military Electronic Equipment, Office of the Assistant Secretary of
Defense (Research and Engineering), Department of Defense, Washington, D.C., 1957.
9. MIL-R-25717 (USAF), Reliability Assurance Program for Electronic Equipment,
Department of Defense, Washington, D.C.
10. Dhillon, B.S., Reliability and Quality Control: Bibliography on General and Specialized Areas, Beta Publishers, Gloucester, Ontario, 1992.
11. Dhillon, B.S., Reliability Engineering Application: Bibliography on Important Application Areas, Beta Publishers, Gloucester, Ontario, 1992.
12. Niebel, B.W., Engineering Maintenance Management, Marcel Dekker, New York,
1994.
13. Dhillon, B.S., Engineering Design: A Modern Approach, Richard D. Irwin, Chicago,
IL, 1996.
14. Elsayed, E.A., Reliability Engineering, Addison Wesley Longman, Reading, MA,
1996.
15. Dhillon, B.S., Advanced Design Concepts for Engineers, Technomic Publishing Company, Lancaster, PA, 1998.
16. Reliability, Maintainability, and Supportability Guidebook, SAE G-11 RMS Committee Report, Society of Automotive Engineers (SAE), Warrendale, PA, 1990.
17. Dhillon, B.S., Mechanical Reliability: Theory, Models and Applications, American
Institute of Aeronautics and Astronautics, Washington, D.C., 1988.
18. Dhillon, B.S., Reliability in Computer System Design, Ablex Publishing, Norwood,
NJ, 1987.
19. Dhillon, B.S., Human Reliability with Human Factors, Pergamon Press, New York,
1986.
20. Tillman, F.A., Hwang, C.L., and Kuo, W., Optimization of Systems Reliability, Marcel
Dekker, New York, 1980.
21. Tillman, F.A., Hwang, C.L., and Kuo, W., Optimization techniques for system reliability with redundancy: A review, IEEE Trans. Reliability, 26, 148-155, 1977.
22. Dhillon, B.S., Power System Reliability, Safety and Management, Ann Arbor Science,
Ann Arbor, MI, 1983.
23. Dhillon, B.S., Robot Reliability and Safety, Springer-Verlag, New York, 1991.
24. Dhillon, B.S., Life Cycle Costing: Techniques, Models, and Applications, Gordon and
Breach Science Publishers, New York, 1989.
Design Reliability
Mathematics
2.1 INTRODUCTION
The history of our current number symbols often referred to as the Hindu-Arabic
numeral system may be traced back over 2000 years to the stone columns erected
by the Scythian Indian emperor Asoka. Some of these columns erected around
250 B.C. contain the present number symbols [1]. It appears the Hindus invented
the numeral system and the Arabs transmitted it to western Europe after their invasion
of the Spanish peninsula in 711 A.D. For example, the numeral symbols are found
in a tenth-century Spanish manuscript [1]. However, zero, the Hindu Sunya meaning
empty or void and Arabic Sifr were only introduced into Germany in the
thirteenth century. Nonetheless, around 2000 B.C. Babylonians solved quadratic
equations successfully.
Even though the traces of the differentiation may be found back to the ancient
Greeks, the idea of modern differentiation comes from Pierre Fermat (16011665).
Laplace transforms often used to find solutions to differential equations were one
of the works of Pierre-Simon Laplace (17491827).
The history of probability may be traced back to the writing of a gamblers
manual by Girolamo Cardano (15011576), in which he considered some interesting
issues on probability [1, 2]. However, Pierre Fermat and Blaise Pascal (16231662)
solved independently and correctly the problem of dividing the winnings in a game
of chance. The first formal treatise on probability based on the PascalFermat
correspondence was written by Christiaan Huygens (16291695) in 1657. Boolean
algebra, which plays a significant role in modern probability theory, is named after
the mathematician George Boole (18151864), who published a pamphlet entitled
The Mathematical Analysis of Logic, Being an Essay towards a Calculus of Deductive Reasoning in 1847 [1, 3].
This chapter presents various mathematical concepts related to design reliability.
COMMUTATIVE LAW:
XY = YX
(2.1)
where
X is an arbitrary set or event.
Y is an arbitrary set or event.
Dot () denotes the intersection of sets. Sometime Equation (2.1) is written
without the dot but it still conveys the same meaning.
X+Y=Y+X
(2.2)
where
+ denotes the union of sets.
ASSOCIATIVE LAW:
(XY) Z = X (YZ)
(2.3)
where
Z is an arbitrary set or event.
( X + Y) + Z = X + (Y + Z )
(2.4)
X(Y + Z ) = XY + XZ
(2.5)
X + YZ = ( X + Y) ( X + Z )
(2.6)
X + ( XY) = X
(2.7)
X( X + Y ) = X
(2.8)
XX = X
(2.9)
X+X = X
(2.10)
DISTRIBUTIVE LAW:
ABSORPTION LAW
IDEMPOTENT LAW
(2.11)
(2.12)
()
P S =1
(2.13)
where
S is the negation of the sample space S.
The probability of the union of n independent events is
P (A1 + A 2 + A 3 + + A n ) = 1
(1 P (A ))
i
(2.14)
i =1
where
Ai is the ith event; for i = 1, 2, , n.
P (Ai) is the probability of occurrence of event Ai ; for i = 1, 2, , n.
For n = 2, Equation (2.14) reduces to
P (A1 + A 2 ) = P (A1 ) + P (A 2 ) P (A1 ) P (A 2 )
(2.15)
P (A )
i
i =1
(2.16)
(2.17)
( )
P (A) + P A = 1
(2.18)
where
P (A) is the probability of occurrence of A.
P (A) is the probability of nonoccurrence of A.
2.4.1
DEFINITION
OF
PROBABILITY
(2.19)
where
P (X) is the probability of occurrence of event X.
N is the number of times that X occurs in the n repeated experiments.
2.4.2
F (t ) =
f (x) dx
where
t is time.
F (t) is the cumulative distribution function.
f (t) is the probability density function.
(2.20)
F () =
f (x) dx
(2.21)
=1
It means the total area under the probability density curve is equal to unity.
2.4.3
d f (x) d x
d F (t )
=
dt
dt
(2.22)
= f (t )
2.4.4
RELIABILITY FUNCTION
= 1
f (x) dx
(2.23)
where
R (t) is the reliability function.
2.4.5
(2.24)
2.4.6
f (s) = f (t )e st d t
(2.25)
where
s is the Laplace transform variable.
t is the time variable.
f (s) is the Laplace transform of f (t).
Example 2.1
If f (t) = et, obtain the Laplace transform of this function.
Thus, substituting the above function into Equation (2.25), we get
f (s) = e t e st d t
0
s+ t
= e ( ) dt
(2.26)
e ( )
(s + )
1
s+
s+ t
Example 2.2
Obtain the Laplace transform of the function f (t) = 1.
By inserting the above function into Equation (2.25), we get
f (s) = 1 . e st d t
0
e st
s
1
s
(2.27)
Table 2.1 presents Laplace transforms of various functions considered useful for
performing mathematical design reliability analysis [9].
TABLE 2.1
Laplace Transforms of Some
Commonly Occurring Functions
in Reliability Work
f (t)
f (s)
1/(s + )
c/s
s f (s) f (0)
e
c, a constant
d f (t)/d t
t
f (t ) d (t )
f (s)/s
t f (t)
1 f1 (t) + 2 f2 (t)
tn, for n = 0, 1, 2, 3,
t et
d f (s)/d s
1 f1 (s) + 2 f2 (s)
n!/sn+1
1/(s + )2
2.4.7
If the following limits exist, then the final-value theorem may be expressed as
lim f (t ) = lim [s f (s)]
2.4.8
s0
(2.28)
EXPECTED VALUE
E( t ) = =
t f (t ) d t
(2.29)
where
t is a continuous random variable.
f (t) is the probability density function.
is the mean value.
Similarly, the expected value, E(t), of a discrete random variable t is defined by
m
E( t ) =
t f (t )
i
i =1
where
m is the number of discrete values of the random variable t.
(2.30)
2.5.1
BINOMIAL DISTRIBUTION
for
y = 0, 1, 2,, k
(2.31)
where
()
k
= k! i! ( k i ) !
i
i p
q k i
(2.32)
i=0
where
F(y) is the cumulative distribution function or the probability of y or less
failures in k trials.
2.5.2
POISSON DISTRIBUTION
This is another distribution that finds applications in reliability analysis when one
is interested in the occurrence of a number of events that are of the same type. Each
events occurrence is denoted as a point on a time scale and each event represents
a failure. The Poisson density function is expressed by
f (m) =
( t )m e t ,
m!
for
m = 0, 1, 2,
(2.33)
where
is the constant failure or arrival rate.
t is the time.
The cumulative distribution function, F, is
m
F=
[( t)
e t
i=0
2.5.3
i!
(2.34)
EXPONENTIAL DISTRIBUTION
This is probably the most widely used distribution in reliability engineering because
many engineering items exhibit constant hazard rate during their useful life [11].
Also, it is relatively easy to handle in performing reliability analysis.
The probability density function of the distribution is defined by
f (t ) = e t
t 0, > 0
(2.35)
where
f (t) is the probability density function.
t is time.
is the distribution parameter. In reliability studies, it is known as the
constant failure rate.
Using Equations (2.20) and (2.35), we get the following expression for the cumulative distribution function:
F( t ) = 1 e t
(2.36)
With the aid of Equations (2.29) and (2.35), the following expression for the expected
value of t was obtained:
E( t ) = 1
2.5.4
(2.37)
RAYLEIGH DISTRIBUTION
This distribution is often used in reliability work and in the theory of sound and is
named after John Rayleigh (18421919), its originator [1]. The probability density
function of the distribution is defined as
2
t
2
f (t ) = 2 t e
t 0, > 0
(2.38)
where
is the distribution parameter.
Inserting Equation (2.38) into Equation (2.20), we get the following equation for the
cumulative distribution function:
F( t ) = 1 e
t 2
(2.39)
Using Equations (2.29) and (2.38), we get the following expression for the
expected value of t:
E(t ) = (3 2)
(2.40)
where
() is the gamma function and is defined by
(x) =
x 1
e t d t,
for x > 0
(2.41)
2.5.5
WEIBULL DISTRIBUTION
This distribution can be used to represent many different physical phenomena and
it was developed by Weibull in the early 1950s [12]. The probability density function
for the distribution is defined by
b
t
b t b 1
f (t ) =
e
b
t 0, > 0, b > 0
(2.42)
where
b and are the shape and scale parameters, respectively.
Using Equations (2.20) and (2.42), we get the following cumulative distribution
function:
F( t ) = 1 e
t b
(2.43)
Substituting Equation (2.42) into Equation (2.29), we get the following equation for
the expected value of t:
1
E ( t ) = 1 +
b
(2.44)
For b = 1 and 2, the exponential and Rayleigh distributions are the special cases of
this distribution, respectively.
2.5.6
GENERAL DISTRIBUTION
This distribution can be used to represent the entire bathtub shape hazard rate
curve [13]. Its probability density function is defined by:
f (t ) = k u t u 1 + (1 k ) b t b 1 e t
exp k1 tb u ( k ) e
1
(2.45)
)]
F (t ) = 1 exp k t u (1 k ) e t 1
b
(2.46)
2.5.7
NORMAL DISTRIBUTION
This is one of the most widely known distributions and sometimes it is called the
Gaussian distribution after Carl Friedrich Gauss (17771855), the German mathematician. Nonetheless, its use in reliability studies is rather limited. The probability
density function of the distribution is defined by:
f (t ) =
( t )2
1
n
;
exp
2
2
2
< t < +
(2.47)
where
n and are the distribution parameters (i.e., mean and standard deviation,
respectively).
Using Equations (2.20) and (2.47), we get the following cumulative distribution
function:
( x )2
n
dx
exp
2 2
1
F (t ) =
2
(2.48)
(2.49)
2.6.1
EXPONENTIAL DISTRIBUTION
By substituting Equations (2.35) and (2.36) into Equation (2.24), we get the following hazard rate function for the exponential distribution:
(t ) =
e t
e t
(2.50)
=
It means the hazard rate of an exponential distribution is constant and it is simply
called the failure rate.
2.6.2
RAYLEIGH DISTRIBUTION
Using Equations (2.24), (2.38), and (2.39), the following hazard rate expression is
for the Rayleigh distribution:
2
2
2
(t ) = 2 t e ( t ) e ( t )
(2.51)
= 2 t 2
Equation (2.51) indicates that the hazard rate of the Rayleigh distribution increases
linearly with time.
2.6.3
WEIBULL DISTRIBUTION
(2.52)
b
= b t b 1
Equation (2.52) indicates that the Weibull hazard rate is a function of time t and at
b = 1 and b = 2, it reduces to the hazard rates for exponential and Rayleigh distributions, respectively.
2.6.4
GENERAL DISTRIBUTION
With the aid of Equations (2.24), (2.45), and (2.46), we get the following expression
for the general distribution hazard rate:
(t ) = k u t u1 + (1 k ) b t b1 e t
exp k1 tb u ( k ) e
)]
exp k t u (1 k ) e t 1
b
= k u t u1 + (1 k ) b t b1 e t
(2.53)
Obviously, in Equation (2.53) the hazard rate, (t), varies with time and its special
case hazard rates are for Weibull, bathtub, extreme value, Makeham, Rayleigh, and
exponential distributions.
2.6.5
NORMAL DISTRIBUTION
Using Equations (2.24), (2.47), and (2.48), we get the following hazard rate expression for the normal distribution hazard rate:
1
( t )2
n
(t ) =
exp
2
2
2
1
n
x ( )
exp
2 2
d x
(2.54)
of M(s) is less than that of Q(s). Consequently, the ratio of M(s)/Q(s) may be
expressed as the sum of rational functions or partial fractions as follows:
A (a s + b) , for m = 1, 2, 3,
m
(Bs + C) a s2 + bs + c
(2.55)
, for m = 1, 2, 3,
(2.56)
where
A, B, C, a, b, and c are the constants.
s is the Laplace transform variable.
Heaviside Theorem
This theorem is due to Oliver Heaviside (18501925) [1] and is used to obtain partial
fractions and the inverse of a rational function, M(s)/Q(s). Thus, the inverse of
M(s)/Q(s) is expressed as follows:
M(s)
1
=
Q(s)
M ( i )
Q ( ) e
i =1
it
(2.57)
where
the prime on Q denotes differentiation with respect to s.
m denotes the total number of distinct zeros of Q(s).
i denotes zero i; for i = 1, 2, 3, , m.
t is time.
Example 2.3
Suppose
M(s)
s
=
.
Q(s) (s 5) (s 7)
Find its inverse Laplace transforms by applying the Heaviside theorem.
Hence,
M(s) = s, Q(s) = (s 5) (s 7) = s2 , 12 s + 35 Q (s)
= 2 s 12, m = 2, 1 = 5, and 2 = 7.
(2.58)
Using the above values in the right-hand side of Equation (2.57) yields:
=
M(1 ) 1 t M ( 2 ) 2
e +
e
Q ( 1 )
Q ( 2 )
M(5) 5t M(7) 7 t
e +
e
Q (5)
Q (7)
=
=
(2.59)
5 5t 7 7t
e + e
2
2
7 7t 5 5t
e e
2
2
Thus, the inverse Laplace transform of Equation (2.58) is given by Equation (2.59).
2.7.2
EQUATION ROOTS
(2.60)
where
a, b, and c are the constants.
Solutions to Equation (2.60) are as follows:
y1, y 2 = b W1 2 2 a
(2.61)
W b2 4 a c
(2.62)
where
(2.63)
y1y 2 = c a
(2.64)
and
(2.65)
Let
1
1
a a2
3 1 9 2
(2.66)
1
(a a 3 a 0 ) 271 a 23
6 1 2
(2.67)
M=
N=
If y1, y2, and y3 are the roots of Equation (2.65), then we have:
y1 y 2 y3 = a 0
(2.68)
y1 + y 2 + y3 = a 2
(2.69)
y1 y 2 + y1 y3 + y 2 y3 = a1
(2.70)
g1 = N + M3 + N 2
12 12
(2.71)
and
(2.72)
y1 = (g1 + g2 ) a 2 3
(2.73)
g1 = N M 3 + N 2
12 12
y2 =
1
(g + g ) a 2 3 + i 23 (g1 g2 )
2 1 2
(2.74)
y3 =
1
(g + g ) a 2 3 i 23 (g1 g2 )
2 1 2
(2.75)
(2.76)
d P1 (t )
P0 (t ) = 0
dt
(2.77)
where
Pi (t) is the probability that the engineering system is in state i at time t; for
i = 0 (operating), i = 1 (failed).
is the system failure rate.
At time t = 0, P0 (0) = 1, and P1 (0) = 0.
Find solutions to Equations (2.76) and (2.77).
In this case, we use the Laplace transform method. Thus, taking Laplace transforms of Equations (2.76) and (2.77) and using the given initial conditions, we get
P0 (s) =
1
s+
P1 (s) = s (s + )
(2.78)
(2.79)
The inverse Laplace transforms of Equations (2.78) and (2.79) are as follows:
P0 (t ) = e t
(2.80)
P1 (t ) = 1 e t
(2.81)
Thus, Equations (2.80) and (2.81) represent solutions to differential Equations (2.76)
and (2.77).
2.9 PROBLEMS
1. Write an essay on the history of mathematics including probability theory.
2. Prove the following Boolean algebra expression:
X + YZ = ( X + Y) ( X + Z )
(2.82)
R (t ) + F (t ) = 1
(2.83)
3. Prove that
where
R (t) is the item reliability at time t.
F (t) is the item failure probability at time t.
4. Obtain the Laplace transform of the following function:
f (t ) =
t k 1 ea t
,
( k 1)!
for k = 1, 2, 3,
(2.84)
(s + 4) (s + 3)
(2.85)
(2.86)
where
and b are the scale and shape parameters, respectively.
9. What are the special case distributions of the Weibull distribution?
10. What are the special case distributions of the general distribution?
2.10
REFERENCES
1. Eves, H., An Introduction to the History of Mathematics, Holt, Rinehart, and Winston,
New York, 1976.
2. Owen, D.B., Ed., On the History of Statistics and Probability, Marcel Dekker, New
York, 1976.
3. Lipschutz, S., Set Theory, McGraw-Hill, New York, 1964.
4. Fault Tree Handbook, Report No. NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, D.C., 1981.
5. Ramakumar, R., Engineering Reliability: Fundamentals and Applications, PrenticeHall, Englewood Cliffs, NJ, 1993.
6. Mann, N.R., Schafer, R.E., and Singpurwalla, N.D., Methods for Statistical Analysis
of Reliability and Life Data, John Wiley & Sons, New York, 1974.
7. Lipschutz, S., Probability, McGraw-Hill, New York, 1965.
8. Shooman, M.L., Probabilistic Reliability: An Engineering Approach, McGraw-Hill,
New York, 1968.
9. Oberhettinger, F. and Badic, L., Tables of Laplace Transforms, Springer-Verlag, New
York, 1973.
10. Patel, J.K., Kapadia, C.H., and Owen, D.B., Handbook of Statistical Distributions,
Marcel Dekker, New York, 1976.
11. Davis, D.J., An analysis of some failure data, J. Am. Stat. Assoc., June, 113- 150, 1952.
12. Weibull, W., A statistical distribution function of wide applicability, J. Appl. Mech.,
18, 293-297, 1951.
13. Dhillon, B.S., A hazard rate model, IEEE Trans. Reliability, 29, 150, 1979.
14. Abramowitz, M. and Stegun, I.A., Eds., Handbook of Mathematical Functions with
Formulas, Graphs, and Mathematical Tables, National Bureau of Standards, Washington, D.C., 1972.
15. Spiegel, M.R., Mathematical Handbook of Formulas and Tables, McGraw-Hill, New
York, 1968.
3.1 INTRODUCTION
For large engineering systems, management of design and reliability becomes an
important issue. Thus, this chapter is devoted to engineering design and reliability
management.
In engineering, the team design/designer may convey different meanings to
different people [1]. For example, to some a design is the creation of a sophisticated
and complex item such as a space satellite, computer, or aircraft and to others a
designer is simply a person who uses drawing tools to draw the details of an item.
Nonetheless, it is the design that distinguishes the engineer from the scientist as
well as it is the design that ties engineers to markets, which give vent to the creativity
of the engineering profession.
Imhotep, who constructed the first known Egyptian pyramid, Saqqara, in 2650 BC
may be called the first design engineer. As the engineering drawings are closely tied
to engineering design, their history could be traced back to 4000 BC when the
Chaldean engineer Gudea engraved upon a stone tablet the plan view of a fortress [2].
The written evidence of the use of technical drawings only goes back to 30 BC
when Vitruvius, a Roman architect, wrote a treatise on architecture. In 1849, in
modern context, an American named William Minifie was the first person to write
a book on engineering drawings entitled Geometrical Drawings.
Today, there are a vast number of publications on both engineering design and
engineering drawings [3]. An engineering design is only good if it is effective and
reliable.
The history of reliability management is much shorter than that of the engineering design and it can only be traced back to the 1950s. Probably the first prime
evidence related to reliability management is the formation of an ad hoc committee
in 1958 for Guided Missile Reliability under the Office of the Assistant Secretary
of Defense for Research and Engineering [4]. In 1959, a reliability program document (exhibit 58-10) was developed by the Ballistic Missile Division of the U.S.
Air Force. The military specification, MIL-R-27542, was the result of Department
of Defense efforts to develop requirements for an organized contractor reliability
program. In 1985, a book entitled Reliability and Maintainability Management was
published [5] and a comprehensive list of publications on the subject is given in
Reference 6.
This chapter discusses engineering design and reliability management separately.
3.2.1
DESIGN FAILURES
AND
History has witnessed many engineering design failures, in particular, in the immediate past two well-publicized design failures were the Chernobl nuclear reactors
and the space shuttle Challenger. Some of the other important design failures are as
follows [8, 9]:
In 1937, a high school in Texas converted from burning methane city gas
to less expensive natural gas. Subsequently an explosion occurred and
455 persons died. An investigation of the accident indicated that the area
through which the pipes ran was poorly ventilated. A recommendation
was made for the installation of malodorants to all natural gas to detect
any possible leaks.
In 1963, a U.S. Navy nuclear submarine, the U.S.S. Thresher, slipped
beneath the Atlantic Ocean surface by exceeding its designed maximum
test depth and imploded.
In 1979, a DC-10 airplane lost an engine during a flight and subsequently
crashed. A follow up investigation indicated that the normal engine service
operation was the cause of the problem. More specifically, as the engines
were periodically dismounted, serviced, and remounted, the mounting
holes became elongated during the servicing process and subsequently
led to the disaster.
In 1980, in the North Sea an offshore oil rig named Alexander L. Kielland
broke up under normal weather conditions. A subsequent study of the
disaster revealed that a 3-in. crack in a part close to a weld joint was the
basic cause of the disaster.
In 1988, a Boeing 737-200 lost its cabin roof during a flight. A preliminary
investigation of the disaster indicated that various metal fatigue cracks
were emanating from rivet holes in the aluminum skin.
Over the years many professionals have studied various design failures and
concluded that there could be many different reasons for product failures ranging
from a disaster to simple malfunction. Most of the common reasons are given in
Table 3.1
TABLE 3.1
Common Reasons for Product Failures
3.2.2
No.
Reason
1
2
3
4
5
6
7
8
9
AND
FUNCTIONS
The design process is an imaginative integration of scientific information, engineering technology, and marketing for developing a useful and profitable item. The
process may be described in as little as from 5 to 25 steps; for example, in 6 steps
by Dieter [10], 8 steps by Vidosic [11], and 12 steps by Hill [12]. Nonetheless, for
our purpose we will describe it in six steps:
1. Need recognition. This is concerned with understanding and identifying
the needs of the user.
2. Problem definition. This is concerned with developing concise problem
statement, identifying problem related needs and limitations, and obtaining related information.
3. Information gathering. This is concerned with collecting various types
of information from sources including specifications and codes, patent
gazette, Internet, handbooks, technical experts, vendor catalogs, and journals.
4. Conceptualization. The results of conceptualization may take the form
of sketches or free-hand drawings and at this stage, the design process
may be regarded as a creative and innovative activity that generates various
possible alternative solutions to the identified goal. Some of the idea
generation techniques that design professionals can use include group
brainstorming, attribute listing, synectics, and morphological chart.
5. Evaluation. This is concerned with deciding on the best solution out of
all the potential solutions. Some of the evaluations include evaluation
based on feasibility judgment, evaluation based on technology-readiness
assessment, and evaluation based on GO/NO-GO screening.
6. Communication of design. The final solution to the engineering problem
leads to final documents representing the product or the product itself, for
the purpose of communicating design to others. The design documents
include items such as engineering drawings, operation and maintenance
instructions, information on quality assurance, patent applications, and
bills of materials.
There are basically five functions involved in engineering design: (1) research,
(2) engineering, (3) manufacturing, (4) quality assurance, and (5) commercial. The
research function activities include conducting basic and applied research, developing specifications for quality testing procedures, and preparing process specifications
for the testing of highly stressed parts. The engineering functions are the subcomponents of the design activity and include activities such as developing new design
concepts, estimating cost, developing production design, making provisions of maintenance instructions, and analyzing field problems. The manufacturing functions
include activities such as assembly, manufacturing planning, determining tooling
needs, and purchasing materials. The main objective of the quality assurance function
is to assure the quality of the end product. Some of the activities covered by the
quality assurance function are designing quality related methods and procedures,
and setting up design auditing. The commercial function is concerned with the
relationship of various clients and its activities include conducting market surveys
and tendering, advertising, managing contracts, and arranging delivery and payments.
3.2.3
AND
MEMBER RESPONSIBILITIES
The design engineer is not the only one person who usually develops design. In real
life, there are many other individuals who also participate. For example, representatives from areas such as manufacturing, field services, and quality control are
involved. The nature of the product dictates the degree of their participation. The
design team members responsibilities may be grouped into four areas: (1) design,
(2) manufacturing, (3) quality, and (4) field service [7]. The design related responsibilities include initial design concept, functional integrity, prototype build, test
procedures, documentation issue, specifications, cost estimates, materials, performance goals, bill of materials, and coordination with reliability group, safety group,
standards group, planning group, and model shop.
Some of the responsibilities belonging to the manufacturing area are tooling,
model build co-ordination, assembly procedures, test and set up procedures, vendor
involvement, pre-production builds, procurement schedule, and co-ordination with
quality control, product planning, materials management, and industrial engineering.
The quality related responsibilities include vendor training, quality audits, field
reliability, total quality management, quality circles, supplier approval, and product
verification.
The field service related responsibilities are service procedures, service tools,
service training, customer training, spares list, service documentation, and so on.
3.2.4
DESIGN REVIEWS
The design reviews are an important factor during the design phase of an engineering
product. The basic purpose of such reviews is to assure the application of correct
design principles, in addition to determining whether the design effort is progressing
according to specifications, plans, and so on.
Various types of reviews are performed during the design phase of a product
and the total cost of performing such reviews varies between 1 and 2% of the total
engineering cost of a project [12].
Different writers and practitioners categorize design reviews differently. For our
purpose, we have divided them into three major categories: (1) preliminary design
review, (2) intermediate design review, and (3) critical design review [12, 13]. Each
of these three category is discussed below.
1. Preliminary design review. This design review is usually conducted prior
to the formulation of the initial design. The primary aim of this review is
to evaluate each and every design specification requirement with respect
to accuracy, validity, and completeness. The areas that could be reviewed
during this design review include cost objective, design alternatives,
present/future availability of materials, applicable legislations, potentional
users/customers, schedule imposed requirements, required functions, customer needs, and critical parts/components.
2. Intermediate design review. This review is conducted before starting the
detailed production drawings. The main objective of this review is to make
comparison of each specification requirement with the design under development. It is to be noted that prior to starting this review, the design
selection process and preliminary layout drawings are complete.
3. Critical design review. This is also known as the final design review and
is conducted after the completion of production drawings. The emphasis
of this review is on areas such as design producibility, value engineering,
review of analysis results, and so on.
Design Review Team
Professionals from various areas participate in design reviews and their number and
type may vary from one project to another. For an effective performance, the size
of the team should not exceed 12 participants [12]. A typical design review team is
made up of professionals such as design engineer, senior design engineer, design
review board chairman, manufacturing engineer, tooling engineer, procurement engineer, customer representative(s) (if any), test engineer, reliability engineer, materials
engineer, and quality control engineer. Each of these individuals evaluates design
from his/her perspective.
Design Review Related Information and Topics
In order to conduct effective reviews, the design review team members must have
access to various information items, as appropriate, including parts list, specifications
and schematic diagrams, acceleration and shock data, circuit descriptions, results of
reliability prediction studies, list of inputs/outputs, results of failure modes and effect
analysis (FMEA), and vibration and thermal tests data.
Usually during design reviews, topics such as mechanical design (i.e., results of
tests, thermal analysis, balance, etc.), specifications (i.e., adherence to specifications,
correctness of specifications, etc.), human factors (i.e., glare, control and display,
labeling and marking, etc.), reproducibility (i.e., reliance of products on a single
part supplier, economical assembly of product in the production shop, etc.), electrical
design (i.e., design simplification, electrical interference, performance, results of
3.2.5
DESIGN ENGINEER
AND
3.2.6
3.2.7
DESIGN LIABILITY
This is a very important factor in product design and can simply be stated as the
liability of designers for injury, loss, or damage caused by defects in products
designed by them. According to Roche [16], approximately 40% of causes of product
liability suits are due to defects in design.
In determining the defectiveness of a product, factors such as the marketing of
product, stated instructions, stated warnings, time of sale, and foreseeable application
and intended use are taken into consideration. In situations where injury results, the
important points (i.e., in U.K.) affecting design liability include [17]:
The designer cannot be liable when the risk of injury was not reasonably
foreseeable.
The designer cannot be totally liable when the risk of injury was foreseeable to him/her and he/she took reasonable design precautions.
The designer is liable when the risk of injury was reasonably foreseeable
to him/her and was not obvious to the product user but the designer took
no reasonable measures to warn.
The designer cannot be totally liable when the risk of injury was reasonably foreseeable and obvious to the user.
All in all, there is no simple answer as to how to reduce design-related risks,
but the design reviews are an appropriate vehicle for identifying and evaluating
potential safety hazards.
3.3.1
In order to have an effective reliability program, there are various factors in which
the responsibilities of management [19] lie:
Establishing a program to access with respect to reliability of the current
performance of company operations and the product it manufactures;
Establishing certain reliability objectives or goals;
Establishing an effective program to fulfill set reliability goals and eradicating current deficiencies. An absolutely effective program should be
able to pay in return many times its establishing cost;
Providing necessary program related authority, funds, manpower, and time
schedule;
Monitoring the program on a regular basis and modifying associated
policies, procedures, organization, and so on, to the most desirable level.
Facts such as the following will be a guiding force for the general management to
have an effective reliability program [20]:
Reliability is an important factor in the management, planning, and design
of an engineering product.
Changes in maintenance, manufacturing, storage and shipping, testing and
usage in the field of the engineering product tend to lower the reliability
of the design.
Planned programs are needed for application in design, manufacturing,
testing and field phases of the engineering product to control reliability.
Reliability is established by the basic design.
It is during the early phases of the design and evaluation testing programs
when high levels of reliability can be achieved most economically.
Improvement in reliability can be through design changes only.
Human error degrades the reliability of the design.
In achieving the desired reliability in a mature engineering product in a
timely manner, deficiency data collection, analysis, and feedback are
important.
3.3.2
For the development of reliability programs, the military specification, MIL-R27542, presents the following guidelines [21]:
Assign reliability associated goals for system in question.
To obtain maximum inherent equipment reliability, put in as much effort
as possible during the design phase.
Evaluate reliability margins.
Perform specification review.
Perform design and procedure reviews.
Establish an appropriate testing program.
Conduct a review of changes in specification and drawings with respect
to reliability.
Establish and maintain control during production through inspection, sample testing, and effective production control.
Develop a closed-loop system for failure reporting, analysis, and feedback
to engineering specialists for corrective measures to stop re-occurrence.
3.3.3
3.3.4
DOCUMENTS
AND
TOOLS
FOR
RELIABILITY MANAGEMENT
Reliability management makes use of various kinds of documents and tools. Some
of the examples of documents used by the reliability management are the in-house
reliability manual; national and international standards, specification or publications;
documents explaining policy and procedures; instructions and plans; reports; and
drawings and contract documents. Similarly, the reliability management makes use
of various kinds of management tools. Some of them are value engineering, configuration management and critical path method or program evaluation and review
technique. Some of these items are discussed in the following.
The reliability manual is the backbone of any reliability organization. Its existence is vital for any organization irrespective of its size. An effective reliability
manual covers topics such as:
Value engineering is concerned with the orderly application of established methods to identify the functions of a product or a service, and also with providing those
identified functions at the minimum cost. Historical evidence indicates that the
application of value engineering concept has returned between $15 to $30 for each
dollar spent [23].
Hundreds of engineering changes are associated with the development of a
complex system. Configuration management assures both the customer and the
manufacturer that the resulting system hardware and software fully meet the contract
specification. Configuration management is defined as the management of technical
requirements which define the engineering system as well as changes thereto.
Both the critical path method (CPM) and the program evaluation and review
technique (PERT) were developed in the late 1950s to manage engineering projects
effectively. Nowadays, both these methods are used quite frequently. The application
of both these techniques have been found effective in areas such as planning and
smoothing resource utilization, obtaining cost estimates for proposed projects, scheduling projects, and evaluating existing schedules and cost vs. budget, as the work
on the project progresses.
3.3.5
IN
RELIABILITY
The audits are important in reliability work to indicate its strong, weak, and acceptable areas. There are various guidelines in conducting reliability audits. Some of
them are maintain auditing schedule, record the audit results, conduct audits without
prior announcements, perform audits using checklists, choose unbiased persons for
auditing, take follow-up actions, and avoid appointing someone permanently for the
auditing purpose.
Reliability auditing has many advantages. For example, it helps problem areas
to surface, determines if the customer specifications are met satisfactorily, helps to
reduce the complaints received from customers, is useful in predicting the reaction
of the buyer toward reliability, and finds out if the company reliability objectives
are fulfilled.
Pitfalls in reliability program management are the causes for many reliability
program uncertainties and problems. These pitfalls are associated with areas such
as follows [24]:
reliability organization
reliability testing
programming
manufacturing
schedule, the parts buyers or even the design engineers will (under certain circumstances) authorize substitute parts without paying much attention to their effect on
reliability. These kinds of pitfalls can only be avoided by an effective reliability
management team.
3.3.6
Probably the most crucial factor in the success or failure of reliability and maintainability programs in an organization is the attitude and the thinking philosophy of top
level management toward reliability and maintainability. More clearly, without the
support of the top management, reliability and maintainability programs in a company
will not be effective at all. Once the high-up managements positive and effective
attitude is generated, then appropriate reliability and maintainability organizations
are to be formed. Two distinct departments, i.e., reliability and maintainability, will
be dictated by the need of the company in question. Sometimes the reliability and
maintainability functions may be combined within a single department or assigned
as the responsibility of the quality assurance department. As to the reporting structure
of the reliability and maintainability departments, there is no iron-clad rule as to how
it should be. However, in order to command respect for these programs their heads
should be high enough on the organizational ladder with necessary authority [25].
Reliability Engineering Department Responsibilities
A reliability engineering department may have various kinds of responsibilities.
However, the major ones are as follows:
3.3.7
RELIABILITY MANPOWER
The reliability group is composed of people who have specialties in various branches
of reliability engineering. However, according to some experts [21], the personnel
involved in reliability work should have experience and background in areas such
as quality control, probability theory, project management, system engineering,
operations research, environmental testing, components design, manufacturing methods, data analysis and collection, developing specifications, and test planning. Furthermore, it is to be noted that professions such as engineering physics, mechanical
engineering, statistics, metallurgy, and electronics engineering are related to reliability. Their relationships are briefly discussed in Reference 26. Some areas related
to reliability personnel are described below.
Rules for Reliability Professionals
These rules are useful for the effective implementation of reliability programs in an
organization by the reliability manager or engineer as well as the concerned professional who can expect some advancement in their careers [27]. Some of these rules
are presented below.
Speak as briefly as possible. In other words, get to the core of the problem
by presenting a summary and recommendations as soon as possible.
When talking to others, make sure that you converse in a language they
understand. Avoid using terms and abbreviations with which they are
uncomfortable.
Avoid using statistical jargon as much as possible especially when dealing
with top-level management.
Make sure that all reliability functions are included in the reliability
program plan.
Contribute to the solution by getting directly involved with the problem
investigation effort and work out a group of corrective action options.
Develop for yourself how to be just as comfortable reporting successes
as reporting failures with respect to reliability.
Avoid, whenever it is possible, paralysis of analysis.
3.4 PROBLEMS
1. Write an essay on important design failures that have occurred over the
past 50 years.
2. List at least five of the most important common reasons for product
failures.
3. Describe a typical design process practiced in the industrial sector.
4. What are the advantages of conducting design review?
5. Discuss the following design reviews:
Preliminary design review
Intermediate design review
Critical design review
6. Describe the functions of the following two professionals:
Design engineer
Reliability engineer
7. What are the design process aspects that have significant impact on product reliability?
8. List the military specification (i.e., MIL-R-27542) guidelines for developing reliability programs.
9. Discuss the documents used by the reliability management.
10. Briefly discuss the tools used by the reliability management.
11. Describe the reliability program management related pitfalls.
12. What are the typical responsibilities of a reliability engineering department?
13. List the difficulties experienced in motivating design engineers to accept
reliability as one of the design parameters.
3.5 REFERENCES
1. Shigley, J.D. and Mitchell, L.D., Mechanical Engineering Design, McGraw-Hill, New
York, 1983.
2. Farr, M., Design Management, Cambridge University Press, London, 1955.
3. Dhillon, B.S., Engineering Design: A Modern Approach, Richard D. Irwin, Chicago,
IL, 1996.
4. Austin-Davis, W., Reliability management: A challenge, IEEE Trans. Reliability,
R-12, 6-9, 1963.
5. Dhillon, B.S. and Reiche, H., Reliability and Maintainability Management, Van
Nostrand Reinhold Company, New York, 1985.
6. Dhillon, B.S., Reliability and Quality Control: Bibliography on General and Specialized Areas, Beta Publishers, Gloucester, Ontario, 1992.
7. Hurricks, P.L., Handbook of Electromechanical Product Design, Longman Scientific
and Technical, Longman Group UK Limited, London, 1994.
8. Walton, J.W., Engineering Design, West Publishing Company, New York, 1991.
9. Elsayed, E.A., Reliability Engineering, Addison Wesley Longman, Reading, MA,
1996.
10. Dieter, G.E., Engineering Design, McGraw-Hill, New York, 1983.
11. Vidosic, J.P., Elements of Design Engineering, The Ronald Press Co., New York,
1969.
12. Hill, P.H., The Science of Engineering Design, Holt, Rhinehart, and Winston, New
York, 1970.
13. Pecht, M., Ed., Product Reliability, Maintainability, and Supportability Handbook,
CRC Press, Boca Raton, FL, 1995.
14. AMCP 706-196, Engineering Design Handbook, Part II: Design for Reliability, 1976.
Prepared by Headquarters, U.S. Army Material Command, Alexandria, VA.
15. Carter, A.D.S., Mechanical Reliability, MacMillan, London, 1986.
16. Roche, J.G., Design implications of product liability, Int. J. Quality and Reliability
Manage., 2, 1988.
17. Abbot, H., Safer by Design, The Design Council, London, 1987.
18. Dhillon, B.S., Engineering reliability management, IEEE J. Selected Areas Comm.,
4, 1015-1020, 1986.
19. Heyel, C., The Encyclopedia of Management, Van Nostrand Reinhold Company, New
York, 1979.
20. Finch, W.L., Reliability: A technical management challenge, Proc. Am. Soc. Quality
Control Annu. Conf., 851-856, 1981.
21. Karger, D.W. and Murdick, R.G., Managing Engineering and Research, Industrial
Press, New York, 1980.
22. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Eds., Handbook of Reliability
Engineering and Management, McGraw-Hill, New York, 1996.
23. Demarle, D.J. and Shillito, M.L., Value Engineering, in Handbook of Industrial
Engineering, Salvendy, Ed., John Wiley & Sons, New York, 1982, 7.3.17.3.20.
24. Thomas, E.F., Pitfalls in reliability program management, Proc. Annu. Reliability
Maintainability Symp., 369-373, 1976.
25. Jennings, J.A., Reliability of aerospace electrical equipment: How much does it cost?,
IEEE Trans. Aerospace, 1, 38-40, 1963.
26. Reliability Engineering: A Career for You, Booklet published jointly by the American
Society for Quality Control (ASQC) and the Institute of Electrical and Electronics
Engineers (IEEE). Available from the Director of Technical Programs, ASQC, 230
West Wells St., Milwakee, WI 53203.
27. Ekings, J.D., Ten rules for the reliability professional, Proc. Annu. Am. Soc. Quality
Control Conf., 343-351, 1982.
28. McClure, J.Y., Organizing for Reliability and Quality Control, in Reliability Handbook, Ireson, G., Ed., McGraw-Hill, New York, 16-37, 1966.
29. Bajaria, H.J., Motivating design engineers for reliability, Proc. Annu. Conf. Am. Soc.
Quality Control, 767-773, 1979.
4.1 INTRODUCTION
Failure data are the backbone of reliability studies because they provide invaluable
information to concerned professionals such as reliability engineers, design engineers, and managers. In fact, the failure data of a product are the final proof of
reliability related efforts expended during its design and manufacture. It would be
impossible to have an effective reliability program without the collection, analysis,
and use of information acquired through the testing and operation of products used
in industrial, military, and consumer sectors.
It may be stated, more specifically, that the basic purpose of developing a formal
system of collecting, analyzing, and retrieving information acquired from past experiences is to allow design and development of better and more reliable products
without repeating the previous efforts concerning research, design, and testing to
achieve the current product reliability. Thus, the fundamental goal of a failure data
collection and analysis system is to convert the relevant information accumulated
in various sources into an effectively organized form so that it can efficiently be
used by individuals with confidence in conducting their assigned reliability related
tasks. There are many different ways to store and present this organized form of
information to the end users. As each such way has advantages and disadvantages,
a careful consideration must be given in their selection.
In particular, the nature of the products and their intended applications are
important factors in deciding the extensiveness, form, etc. of a failure data collection
and analysis system. For example, a company stamping out stainless steel flatware
will require a simple data system as opposed to a complex data system for a company
manufacturing aircraft. All in all, prior to deciding the size, type, etc. of a failure
data system, consider factors such as individuals requiring information, information
required by the individuals to meet their needs, and frequency of their need.
An extensive list of publications on failure data is given in Reference 1. This
chapter presents different aspects of failure data collection and analysis.
4.2
There are many areas for the uses of failure data including conceptual design,
preliminary design, test planning, design reviews, manufacturing, quality assurance
and inspection planning, logistic support planning inventory management, procurement, top management budgeting, and field service planning [2]. Nonetheless, some
of the more specific uses of failure data are estimating hazard rate of an item, performing life cycle cost studies, making decisions regarding introduction of redundancy,
predicting an items reliability/availability, performing equipment replacement studies,
Warranty claims
Previous experience with similar or identical equipment
Repair facility records
Factory acceptance testing
Records generated during the development phase
Customers failure reporting systems
Tests: field demonstration, environmental qualification, and field installation
Inspection records generated by quality control/manufacturing groups
In order to maintain the required quality standards, the quality control groups
within organizations regularly perform tests/inspections on products/equipment. As
the result of such actions, valuable data are generated that can be very useful in
design reviews, part vendor selection, providing feedback to designers on the problems related to production, and so on. The important components of the quality
control data are incoming inspection and test results, quality audit records and final
test results, in-process quality control data, results of machine and process capability
studies, and calibration records of measuring instruments [2].
TABLE 4.1
Selective Standard Documents for Failure Data Collection
Document
classification No.
No.
1
MIL-STD-2155
MODUKDSTAN 00-44
IEEE 500
IEC 362
Document title
Developed by:
U.S. Department of
Defense
International
Electrotechnical
Commission (IEC)
British Defense
Standards (U.K.
Department of
Defense)
Institute of
Electrical and
Electronics
Engineers (IEEE)
International
Electrotechnical
Commission (IEC)
American Society
for Quality Control
(ASQC)
TABLE 4.2
Failure Rates for Some Electronic Items
No.
Item description
Failure rate
(failures/106 h)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Neon lamp
Weld connection
Fuse (cartridge class H or instrument type)
Fiber optic cable (single fiber types only)
Solid state relay (commercial grade)
Vibrator (MIL-V-95): 60-cycle
Terminal block connection
Microwave ferrite devices: isolators and circulatory (100 watts)
Solid state relay (military specification grade)
Single fiber optic connector
Crimp connection
Spring contact connection
Hybrid relay (commercial grade)
Vibrator (MIL-V-95): 120-cycle
Fuse: current limiter type (aircraft)
0.2
0.000015a
0.010a
0.1 (per fiber km)
0.0551a
15
0.062a
0.10a
0.029a
0.1
0.00026a
0.17a
0.0551a
20
0.01a
4.7.1
The hazard function of a statistical distribution is the basis for this theory. Thus, the
hazard rate is defined by
TABLE 4.3
Failure Rates for Some Mechanical Items
No.
Item description
Failure rate
(failures/106 h)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Pivot
Heat exchanger
Relief valve
Conveyor belt (heavy load)
Heavy duty ball bearing
Piston
Washer
Nut or bolt
Rotating seal
Pipe
Spring (torsion)
Gearbox (reduction)
Bellow
Crankshaft (general)
Cylinder
Knob (general)
Flexible coupling
Axle (general)
Hair spring
Slip ring (general)
1
6.11244.3
0.510
20140
20
1
0.5
0.02
4.4
0.2
14.296a
18.755b
5
33.292b
0.1
2.081a
9.987b
9.539b
1
0.667a
a
b
TABLE 4.4
Human Error Rates for Some Tasks
No.
Error description
Error rate
1
2
3
4
5
6
7
0.0153a
1800b
0.0134a
64500b
5000a
0.0076a
0.0401a
a
b
z (t ) =
f (t )
R( t )
f (t )
=
1 F( t )
(4.1)
where
z(t)
R(t)
f (t)
F(t)
is
is
is
is
the
the
the
the
F (t ) = f (t ) d t
(4.2)
zc (t ) = z (t ) d t
(4.3)
Equations (4.1) through (4.3) are used to obtain a straight line expression for
estimating graphically the parameters of failure distributions such as the following
[20, 25]:
Exponential Distribution
This is a single parameter distribution and its probability density function is defined
by
f (t ) = e t
t0
(4.4)
where
is the parameter known as the constant failure rate.
t is time.
Inserting Equation (4.4) into Equation (4.2) yields
F (t ) = 1 e t
(4.5)
FIGURE 4.1 A hypothetical plot of time to failure, t, against cumulative hazard, zc.
(4.6)
By inserting Equation (4.6) into Equation (4.3), we obtain the following expression
for the cumulative hazard function:
zc (t ) = t
(4.7)
By letting = 1/, where is the mean time to failure and rearranging Equation (4.7)
to express the time to failure, t, as a function of zc, we get
t (z c ) = z c
(4.8)
Equation (4.8) is the equation of a straight line passing through the origin. In order
to estimate the value of , the time to failure t against the cumulative hazard zc is
plotted. If the plotted field data points fall roughly on a straight line, then a line is
drawn to estimate . At zc = 1 on this plot, the corresponding value of t is the
estimated value of . The value of the also equals the slope of the straight line.
When the plotted data points do not approximately fall along a straight line, this
indicates that the failure data do not belong to the exponential distribution and
another distribution should be tried. A hypothetical plot of Equation (4.8) is shown
in Figure 4.1.
Bathtub Hazard Rate Distribution
This distribution can represent a wide range of failure data representing increasing,
decreasing, and bathtub hazard rates. The probability density function of the distribution is defined by [26, 27]:
f (t ) = ( t )
e( ) ( t ) 1
(4.9)
where
is the scale parameter.
is the shape parameter.
t is time.
Substituting Equation (4.9) into Equation (4.2) yields
F (t ) = 1 e
{e( t ) 1}
(4.10)
t
e( ) .
(4.11)
t
z c ( t ) = e( ) 1
(4.12)
Rearranging (4.12) and twice taking natural logarithms of (zc + 1), we get
ln t =
1
ln x ln
(4.13)
where
x ln (zc + 1)
The plot of ln t against ln x gives a straight line; thus, the slope of the line and its
intercept are equal to 1/ and (-ln), respectively. If the plotted field data points fall
roughly on a straight line, then a line can be drawn to estimate the values of and .
On the other hand, when the plotted data do not approximately fall along a straight
line, this obviously indicates that the failure data under consideration do not belong
to the bathtub hazard rate distribution; thus, try another distribution.
Weibull Distribution
This is a widely used distribution in reliability studies and it can represent a wide
range of failure data. The probability density function of the distribution is defined by
f (t ) =
1 ( t )
t e
,
for t 0
(4.14)
where
t is time.
is the scale parameter.
is the shape parameter.
Inserting Equation (4.14) into Equation (4.2) yields
t
F (t ) = 1 e ( )
(4.15)
At = 1 and 2, the Weibull distribution becomes exponential and Rayleigh distributions, respectively.
By substituting Equations (4.14) and (4.15) into Equation (4.1), we get
z (t ) =
1
t
(4.16)
(4.17)
In order to express the time to failure t as a function of zc, we rearrange Equation (4.17)
as follows:
t (z c ) = (z c )
(4.18)
1
ln z c + ln
(4.19)
TABLE 4.5
Failure and Running (Censoring) Times
for 16 Identical Units
Unit No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
14,092a
3,973
2,037
360
628
3,279
2,965
837a
79
13,890a
184
13,906a
499
134
10,012
3,439a
Example 4.1
Assume that 16 identical units, at time t = 0, were put on test and Table 4.5 presents
failure and running times of these units. Determine the statistical distribution fit to
the given data and estimate values for its parameters by using the hazard plotting
steps described below.
Hazard Plotting Steps
The following eight steps are used to construct a hazard plot:
1. Order the data containing m times from smallest to largest without making
any distinction whether these data are running (censoring) or failure times
of units. Use an asterisk (*) or other means to identify the running or
censoring times. In the event of having some running and failure times
equal, mix such times well on the ordered list of smallest to largest times.
For Example 4.1 data given in Table 4.5, in Column 1 of Table 4.6, the
failure and running times of 16 units are ranked from smallest to largest
with the censoring times identified by a superscript a.
2. Label the ordered times with reverse ranks. For example, label the first
time with m, the second with (m-1), the third with (m-2), , and the mth
with 1.
Column 2 of Table 4.6 presents reverse rank labels for the 16 ordered
times of Column 1.
TABLE 4.6
Ordered Times and Hazard Values
Col. 1
Ordered times t in hours
(smallest to largest)
Col. 2
Reverse rank
labels
Col. 3
Hazard value
(100/Col. 2 value)
Col. 4
Cumulative
hazard (zc )
79
134
184
360
499
628
837a
2,037
2,965
3,279
3,439a
3,973
10,012
13,890a
13,906a
14,092a
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
6.25
6.67
7.14
7.69
8.33
9.09
11.11
12.5
14.29
20
25
6.25
12.92
20.06
27.75
36.08
45.17
56.28
68.78
83.07
103.07
128.07
3. Compute a hazard value for each failure using 100/m, 100/(m-1), 100/(m-2),
etc. The hazard value may be described as the conditional probability of
failure time. In other words, it is the observed instantaneous failure rate
at a certain failure time.
For example, as per Columns 1 and 2 of Table 4.6, the first failure
occurred after 79 h out of 16 units put on test at time t = 0. Thus, the
hazard value is
100
= 6.25%
16
After 79 h, only 15 units were operating and 1 failed after 134 h. Similarly,
the hazard value is
100
= 6.67%
15
In a similar manner, the hazard values at other failure times were computed
as shown in Column 3 of Table 4.6. These values are simply 100 divided
by the corresponding number of Column 2.
4. Obtain the cumulative hazard value for each failure by adding its hazard
value to the cumulative hazard value of the preceding failure. For example,
for the first failure at 79 h (as shown in Column 1 of Table 4.6), its hazard
value is 6.25% and there is no cumulative hazard value of the preceding
failure because this is the first failure. Thus, in this case, the cumulative
hazard simply is 6.25%. However, for the failure at 2037 h, the cumulative
hazard is 11.11 (hazard value) + 45.17 (cumulative hazard value of the
preceding failure) = 56.28. Similarly, Column 4 of Table 4.6 presents
cumulative hazard values for other failures.
It is to be noted that the cumulative hazard values may exceed 100%
and have no physical interpretation.
5. Choose a statistical distribution and prepare times to failure and corresponding cumulative hazard data for use in the selected distribution to
construct a hazard plot.
For our case, we select the Weibull distribution; thus, we take natural
logarithms of the times to failure and of corresponding cumulative hazard
values given in Table 4.6. The processed data for ln t and ln zc are
presented in Table 4.7.
6. Plot each time to failure against its corresponding cumulative hazard value
on a graph paper. Even though the running times are not plotted, they do
determine the plotting points of the times to failure through the reverse
ranks.
In our case, we plotted ln t against ln zc as shown in Figure 4.2 using
Table 4.7 values.
TABLE 4.7
Processed Data for the Weibull Distribution
lnt
(using values for t
from Col. 1Table 4.6)
4.37
4.90
5.22
5.89
6.21
6.44
7.62
8.00
8.10
8.29
9.21
lnzc
(using values for zc
from Col. 4Table 4.6)
1.83
2.56
3.00
3.32
3.59
3.81
4.03
4.23
4.42
4.64
4.85
7. Determine if the plotted data points roughly follow a straight line. If they
do, it is reasonable to conclude that the selected distribution adequately
fits the data and draw a best fit straight line. If they do not, try plotting
the given data for another statistical distribution.
In our case, the given data roughly follow the straight line; thus, it is
reasonable to conclude that the data are Weibull distributed.
8. Estimate the values of the distribution parameters using the hazard plot.
In our case, using Figure 4.2, we have = 0.58, and = 1.82 h.
Finally, it is emphasized that the results obtained using this method are valid only
when the times to failure of unfailed units are statistically independent of their
running or censoring times, if such units were used until their failure.
4.8.1
BARTLETT TEST
Often in reliability analysis, it is assumed that an items times to failure are exponentially distributed. The Bartlett test is a useful method to verify this assumption
from a given sample of failure data. The Bartlett test statistic is defined [28] as
W
Sb m = 12 m 2 ln F
(6 m + m + 1)
(4.20)
ln t
(4.21)
i =1
F=
1
m
(4.22)
i =1
where
m is the total number of times to failures in the sample.
t i is the ith time to failure.
In order for this test to discriminate effectively, a sample of at least 20 times to
failure is required. If the times to failures follow the exponential distribution, then
Sbm is distributed as chi-square with (m-1) degrees of freedom. Thus, a two-tailed
chi-square method (criterion) is used [30].
Example 4.2
After testing a sample of identical electronic devices, 25 times to failure as shown
in Table 4.8 were observed. Determine if these data belong to an exponential distribution by applying the Bartlett test.
By inserting the given data into Equation (4.22) we get
F=
3440
25
= 137.6 h
Similarly, using Equation (4.21) and the given data we have
W = 110.6655
TABLE 4.8
Times to Failure (in hours)
of Electronic Devices
5
6
21
18
33
34
47
44
61
65
82
87
110
113
140
143
187
183
260
268
270
280
282
350
351
Inserting the above values and the given data into Equation (4.20) yields
110.6655
2
Sb 10 = 12 (25) ln 137.6
25
(6(25) + 25 + 1)
= 21.21
From Table 11.1 for a two-tailed test with 98% confidence level, the critical
values of
0.02
2 , ( m 1) = 2
, (25 1) = 42.98
2
2
where
= 1 (confidence level) = 1 0.98 = 0.2
and
0.02
2 1 , (m 21) = 1
, (25 1) = 10.85
2
2
The above results indicate that there is no reason to contradict the assumption of
exponential distribution.
4.8.2
This test is also used to determine whether a given set of data belong to an exponential
distribution. The test requires calculating the value of , the 2 variate with 2m
degrees of freedom. The is defined by [28, 29, 32]:
m
= 2
T (i )
ln T ()
(4.23)
i =1
where
T () is the total operating time at test termination.
T (i) is the total operating time at the occurrence of ith failure.
m
is the total number of failures in a given sample.
If the value of lies within
2 , 2 m < <2 1 , 2 m
2
(4.24)
TABLE 4.9
Times to Failure (in hours)
of Electrical Devices
Failure No.
1
2
3
4
5
6
7
8
9
10
70
85
25
10
20
40
50
45
30
60
(4.25)
Example 4.3
Assume that 40 identical electrical devices were tested for 200 h and 10 failed. The
failed devices were not replaced. The failure times of the electrical devices are
presented in Table 4.9. Determine if the times to failure can be represented by the
exponential distribution.
Using the given data, we get
T ( ) = ( 40 10) (200) + 70 + 85 + 25 + 10 + 20 + 40 + 50 + 45 + 30 + 60
T ( ) = 6435 h
Thus, using the above result and the given data in Equation (4.23), we get
(10)( 40)
30 + (25) (38)
10 + (20)(39)
= 2 ln
+ ln
+ ln
6435
6435
6435
55 + (30)(37)
+ ln
+ ln
6435
85 + ( 40)(36)
6435
125 + ( 45)(35)
170 + (50) (34)
+ ln
+ ln
6435
6435
220 + (60)(33)
+ ln
+ ln
6435
280 + (70)(32)
6435
6435
= 30.50
For a 98% confidence level from Equation (4.25) we have
= 1 0.98 = 0.02
Using the above value and other given data in relationship (4.24) we get
0.02
0.02
2
, 2 (10) ,2 1
, 2 (10)
2
2
4.8.3
KOLMOGOROVSMIRNOV TEST
(4.24)
where
n
( )
(4.25)
(} )
max
F ( t i ) F ( t i ) , F ( t i ) F t i 1
i
(4.26)
7. Use Table 4.10 [13, 31, 33, 38] to obtain the value of d for a specified
value of and the sample size.
8. Make comparison of the value of d with the value of d. If d is greater
than d, reject the assumed distribution.
Example 4.4
A medical device manufacturer developed a certain medical device and life tested 12
units of the same device. These units failed after 20, 30, 40, 55, 60, 70, 75, 80, 90, 95,
100, and 110 h of operation. After reviewing the past failure pattern of similar devices,
it was assumed that the times to failure follow the normal distribution. Use the KolomogorovSmirnov method to test this assumption for 0.05 level of significance.
For the specified data, the mean of the normal distribution is given by
m
i =1
TABLE 4.10
KolmogorovSmirnov Critical Values for d
Level of significance,
Sample size, n
= .01
= 0.05
= 0.10
= 0.15
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>50
0.67
0.62
0.58
0.54
0.51
0.49
0.47
0.45
0.43
0.42
0.40
0.39
0.38
0.37
0.36
0.36
1.63 n1/2
0.57
0.52
0.49
0.46
0.43
0.41
0.39
0.38
0.36
0.35
0.34
0.33
0.32
0.31
0.30
0.29
1.36 n1/2
0.51
0.47
0.44
0.41
0.39
0.37
0.35
0.34
0.33
0.31
0.30
0.30
0.29
0.28
0.27
0.26
1.22 n1/2
0.47
0.44
0.41
0.38
0.36
0.34
0.33
0.31
0.30
0.29
0.28
0.27
0.27
0.26
0.25
0.25
1.14 n1/2
where
is the mean.
m is the number of medical devices life tested.
t i is the failure time of device i; for i = 1, 2, 3, , 12.
Similarly, for the given data, the standard deviation, , for the normal distribution
is
m
2
=
t i ) (m 1)
(
i =1
12
(12 1)
12
= 28.53
Using the above values for and and cumulative normal distribution function
table, the values for F(t i) as shown in Table 4.11 were obtained.
The values for F (t i) and F (t i1) presented in Table 4.12 were obtained using
Equations (4.24) and (4.25), respectively.
TABLE 4.11
Values for F (t i )
i
ti
(t i )/
F (t i )
1
2
3
4
5
6
7
8
9
10
11
12
20
30
40
55
60
70
75
80
90
95
100
110
1.70
1.35
1.00
0.48
0.30
0.04
0.21
0.39
0.74
0.92
1.09
1.44
0.04
0.08
0.15
0.31
0.38
0.51
0.58
0.65
0.77
0.82
0.86
0.92
TABLE 4.12
Computed Values for F (t i )
and F (t i1)
i
ti
F (t i )
F (t i1)
1
2
3
4
5
6
7
8
9
10
11
12
20
30
40
55
60
70
75
80
90
95
100
110
1/12 = 0.08
2/12 = 0.16
3/12 = 0.25
= 0.33
= 0.41
= 0.5
= 0.58
= 0.66
= 0.75
= 0.83
= 0.91
12/12 = 1
(1 1)/12 = 0
(2 1)/12 = 0.08
(3 1)/12 = 0.16
= 0.25
= 0.33
= 0.41
= 0.5
= 0.58
= 0.66
= 0.75
= 0.83
(12 1)/12 = 0.91
Using the values presented in Tables 4.11 and 4.12 for F(t i), F (t i), and F (t i1),
the absolute values of F(t i) F (t i) and F(t1) F (t i1) listed in Table 4.13 were
obtained. By examining Table 4.13 we obtained the maximum absolute difference
of 0.11 (i.e., d = 0.11 for Equation (4.26)).
For given data (i.e., = 0.05 and n = 12) and using Table 4.10, we get d0.05 =
0.38. Thus, we conclude that as d0.05 > d (i.e., 0.38 > 0.11), there is no reason to
reject the normal distribution assumption.
TABLE 4.13
Absolute Calculated Values for
F(t i ) F (t i ) and F(t i ) F (t i1 )
i
ti
F(t i ) F (t i )
F(t1) F (t i1)
1
2
3
4
5
6
7
8
9
10
11
12
20
30
40
55
60
70
75
80
90
95
100
110
0.04
0.08
0.1
0.02
0.03
0.01
0
0.01
0.02
0.01
0.05
0.08
0.04
0
0.01
0.06
0.05
0.1
0.08
0.07
0.11 max.
0.07
0.03
0.01
(4.27)
The value of that maximizes the ln L or the L is known as the maximum likelihood
estimator (MLE) of . Usually, the is estimated by solving the following expression:
ln L
=0
(4.28)
(4.29)
where
denotes the estimated value of the .
The application of this method to selected statistical distributions is presented below.
4.9.1
EXPONENTIAL DISTRIBUTION
t0
(4.30)
where
t is time.
is the distribution parameter.
Substituting Equation (4.30) into Equation (4.27) yields
n
L = n e
ti
i =1
(4.31)
ln L = n ln
(4.32)
i =1
t = 0
i
(4.33)
i =1
= n
t
i =1
(4.34)
where
is the estimated value of .
By differentiating Equation (4.32) with respect to twice, we get
2 ln L
= n 2
2
(4.35)
For a large sample size n, using Equation (4.35) in Equation (4.29) yields [40]
Var = 2 n
(4.36)
2 n
4.9.2
NORMAL DISTRIBUTION
2
[ ( t )
1
e
2
2 2
]
(4.37)
log L =
n
n
1
ln 2 ln 2
2
2
2 2
(t )
(4.38)
i =1
Differentiating Equation (4.28) with respect to and equating the resulting expression to zero yields
n
i =1
2 (t i ) ( 1)
=
2 2
=1
ti 1
=
t n = 0
i
i =1
(4.39)
i =1
ti n
(4.40)
where
is the estimated value of .
Similarly, differentiating Equation (4.38) with respect to 2 and setting the resulting
expression equal to zero, we have
1
2
(t ) n = 0
2
(4.41)
i =1
(t )
(4.42)
i =1
where
2 is a biased estimator for 2.
4.9.3
WEIBULL DISTRIBUTION
(4.43)
for t 0
where
is the scale parameter.
b is the shape parameter.
t is time.
By inserting Equation (4.43) into Equation (4.27) and then taking natural logarithms,
we get
n
ln L = n ln + n ln b
i =1
ln t
t ib + ( b 1)
i =1
(4.44)
b
i
=0
b
i
ln t i +
(4.45)
i =1
and
n
ln t i
i =1
t
i =1
n
=0
b
(4.46)
= n
b
i
(4.47)
i =1
and
n
b =
t
i =1
(4.48)
b
i
ln t i
ln t
i =1
where
is the estimated value of .
b is the estimated value of b.
Equations (4.47) and (4.48) can be solved by using an iterative process [39].
4.10
PROBLEMS
TABLE 4.14
Failure and Censoring Times for 20 Identical
Electronic Devices
Device No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
200
600
1000a
1500
2000
400
600a
2500
3000
2600
1300a
1400
900
800a
500
600
1800a
1000
700
1600
Censoring time.
TABLE 4.15
Failure Times (hours) of Electrical Parts
10
12
15
13
20
25
37
33
55
60
70
75
95
98
110
115
140
145
180
200
TABLE 4.16
Parts Failure Times
Failure No.
1
2
3
4
5
6
7
8
40
100
25
30
110
140
80
150
9. A set of 35 identical parts were tested for 300 h and 8 failed. The failed
parts were not replaced and their times to failure are given in Table 4.16.
Determine if these failure times can be represented by the exponential
distribution.
10. Assume that a sample of 10 identical electrical parts were life tested and
they failed after 50, 60, 70, 90, 100, 115, 120, 125, 130, and 140 h. After
reviewing the failure pattern of similar parts, it was assumed that the times
to failure are normally distributed. Use the KolmogorovSmirnov method
to test this assumption for 0.1 level of significance.
11. The probability density function of the gamma distribution is defined by
f ( t; , ) =
t 1 e t
,
( )
for t > 0
(4.49)
where
t
()
is
is
is
is
time.
the scale parameter.
the shape parameter.
the gamma function. For positive integer , () = ( 1)!.
4.11 REFERENCES
1. Dhillon, B.S. and Viswanath, H.C., Bibliography of literature on failure data, Microelectronics and Reliability, 30, 723-750, 1990.
2. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Handbook of Reliability Engineering
and Management, McGraw-Hill, New York, 1996.
3. Kletz, T., The uses, availability and pitfalls of data on reliability, Process Technology,
18, 111-113, 1973.
4. Mitchell, R.L. and Rutter, R.R., A study of automotive reliability and associated cost
of maintenance in the U.S.A., Soc. Automotive Eng. (SAE) Paper No. 780277, Jan.
1978.
5. Dhillon, B.S., Mechanical Reliability: Theory, Models, and Applications, American
Institute of Aeronautics and Astronautics, Washington, D.C., 1988.
6. Hahn, R.F., Data collection techniques, Proc. Annu. Reliability Maintainability Symp.,
38-43, 1972.
7. Parascos, E.T., A new approach to the establishment and maintenance of equipment
failure rate data bases, in Failure Prevention and Reliability, Bennett, S.B., Ross,
A.L., and Zemanick, P.Z., Eds., American Society of Mechanical Engineers, New
York, 1977, 263-268.
8. A Reliability Guide to Failure Reporting, Analysis, and Corrective Action Systems,
Committee on Reliability Reporting, American Society for Quality Control (ASQC),
Milwaukee, 1977.
9. Dhillon, B.S., Advanced Design Concepts for Engineers, Technomic Publishing Company, Lancaster, PA, 1998.
10. MIL-STD-1556, Government/Industry Data Exchange Program (GIDEP), Department of Defense, Washington, D.C.
11. MIL-HDBK-217F (Notice 2), Reliability Prediction of Electronic Equipment, Department of Defense, Washington, D.C., 1995.
12. TD-84-3, Reliability and Maintainability Data for Industrial Plants, A.P. Harris and
Associates, Ottawa, Canada, 1984.
13. Schafer, R.E., Angus, J.E., Finkelstein, J.M., Yerasi, M., and Fulton, D.W., RADC
Nonelectronic Reliability Notebook, Reliability Analysis Center, Rome Air Development Center, Griffiss Air Force Base, Rome, New York, 1985. Report No. RADCTR-85-194.
14. Rossi, M.J., Nonelectronic Parts Reliability Data, Reliability Analysis Center, Rome
Air Development Center, Griffiss Air Force Base, Rome, New York, 1985. Report
No. NPRD-3.
15. Green, A.E. and Bourne, A.J., Reliability Technology, John Wiley & Sons, Chichester,
England, 1972.
16. Dhillon, B.S., Human Reliability: With Human Factors, Pergamon Press, New York,
1986.
17. Joos, D.W., Sabri, Z.A., and Husseiny, A.A., Analysis of gross error rates in operation
of commercial nuclear power stations, Nuclear Eng. Design, 52, 265-300, 1979.
18. Peters, G.A., Human error: analysis and control, Am. Soc. Safety Eng. J., XI, 9-15,
1966.
19. Recht, J.L., Systems safety analysis: error rates and costs, National Safety News,
February 1966, pp. 20-23.
20. Dhillon, B.S., Reliability Engineering in Systems Design and Operation, Van Nostrand Reinhold Company, New York, 1983.
21. Nelson, W., Hazard plot analysis of incomplete failure data, Proc. Annu. Symp.
Reliability, 391-403, 1969.
22. Nelson, W., Theory and applications of hazard plotting for censored failure data,
Technometrics, 14, 945-966, 1972.
23. Nelson, W., Hazard plotting for incomplete failure data, J. Quality Technol., 1, 27-52,
1969.
24. Nelson, W., Life data analysis by hazard plotting, Evaluation Eng., 9, 37-40, 1970.
25. Nelson, W., Applied Life Data Analysis, John Wiley & Sons, New York, 1982.
26. Dhillon, B.S., A hazard rate model, IEEE Trans. Reliability, 28, 150, 1979.
27. Dhillon, B.S., Life distributions, IEEE Trans. Reliability, 30, 457-460, 1981.
28. Epstein, B., Tests for the validity of the assumption that the underlying distribution
of life is exponential, Technometrics, 2, 83-101, 1960.
29. Epstein, B., Tests for the validity of the assumption that the underlying distribution
of life is exponential, Technometrics, 2, 327-335, 1960.
30. Lamberson, L.R., An evaluation and comparison of some tests for the validity of the
assumption that the underlying distribution of life is exponential, AIIE Trans., 12,
327-335, 1974.
31. Massey, F., The Kolmogorov-Smirnov test for goodness of fit, J. Am. Stat. Assoc.,
46, 70, 1951.
32. Dhillon, B.S., Quality Control, Reliability, and Engineering Design, Marcel Dekker,
New York, 1985.
33. AMC Pamphlet 706-198, Development Guide for Reliability, Part IV, U.S. Army
Materiel Command, Department of Defense, Washington, D.C., 1976.
34. Reliability and Maintainability Handbook for the U.S. Weather Bureau, Publication
No. 530- 01-1-762, ARINC Research Corporation, Annapolis, MD, April 1967.
35. Ehrenfeld, S. and Mauer, S.B.L., Introduction to Statistical Method, McGraw-Hill,
New York, 1964.
36. Klerer, M. and Korn, G.A., Digital Computer Users Handbook, McGraw-Hill, New
York, 1967.
37. Conover, W.J., Practical Nonparametric Statistics, John Wiley & Sons, New York,
1971.
38. AMCP 702-3, Quality Assurance Handbook, U.S. Army Material Command, Washington, D.C., 1968.
39. Lloyd, M. and Lipow, M., Reliability: Management, Methods, and Mathematics,
Prentice-Hall, Englewood Cliffs, NJ, 1962.
40. Shooman, M.L., Probablistic Reliability: An Engineering Approach, McGraw-Hill,
New York, 1968.
41. Mann, N., Schafer, R.E., and Singpurwalla, N.D., Methods for Statistical Analysis of
Reliability and Life Data, John Wiley & Sons, New York, 1974.
Basic Reliability
Evaluation and
Allocation Techniques
5.1 INTRODUCTION
As engineering systems become more complex and sophisticated, the importance
of reliability evaluation and allocation techniques during their design is increasing.
Usually, in the design of engineering systems, reliability requirements are specified.
These requirements could be in the form of system reliability, failure rate, mean
time between failures (MTBF), and availability. Normally, in order to determine the
fulfillment of such requirements, various reliability evaluation and allocation techniques are employed.
Over the years, many reliability evaluation techniques have been developed but
their effectiveness and advantages may vary quite considerably. Some of these methods
are known as block diagram, decomposition, delta-star, and Markov modeling [1, 2].
Reliability allocation may be described as the top-down process used to subdivide
a system reliability requirement or goal into subsystem and component requirements
or goals. Its basic objectives are to translate the system reliability requirement into more
manageable, lower level (subsystem and component) requirements and to establish an
individual reliability requirement for each subsystem/hardware designer or supplier.
There are many methods and techniques available to perform reliability allocation [3].
This chapter not only describes basic reliability evaluation and allocation techniques, but also associated areas because a clear understanding of these areas is
considered essential prior to learning the evaluation and allocation methods.
During the useful life period, the hazard rate remains constant and there are
various reasons for the occurrence of failures in this region: undetectable defects,
low safety factors, higher random stress than expected, abuse, human errors, natural
failures, explainable causes, etc.
During the wear out period, the hazard rate increases and the causes for the
wear out region failures include wear due to aging, corrosion and creep, short
designed-in life of the item under consideration, poor maintenance, wear due to
friction, and incorrect overhaul practices.
The following hazard rate function can be used to represent the entire bathtub
hazard rate curve [5]:
(t ) = b t b1 + (1 = ) c t c1 e t
for
c, b, , > 0
0 1
b = 0.5
c =1
t0
where
t
(t)
b and c
and
is time.
is the hazard rate.
are the shape parameters.
are the scale parameters.
(5.1)
5.3.1
This is defined by
d R (t )
= f (t )
dt
(5.2)
where
R(t) is the item reliability at time t.
f(t) is the failure (or probability) density function.
5.3.2
This is expressed by
( t ) = f ( t ) R( t )
(5.3)
where
(t) is the item hazard rate or time dependent failure rate.
Substituting Equation (5.2) into Equation (5.3) yields
(t ) =
5.3.3
d R( t )
1
R( t )
dt
(5.4)
(5.5)
Integrating both sides of Equation (5.5) over the time interval [o, t], we get
R(t)
since at t = 0, R (t) = 1.
1
d R (t ) = (t ) d t
R (t )
o
(5.6)
ln R(t ) = (t ) d t
(5.7)
R (t ) = e
(t)d t
(5.8)
The above equation is the general expression for the reliability function. Thus, it can
be used to obtain reliability of an item when its times to failure follow any known
statistical distribution, for example, exponential, Rayleigh, Weibull, and gamma.
5.3.4
MEAN TIME
TO
FAILURE
This can be obtained by using any of the following three formulas [6]:
MTTF = E(t ) =
t f (t ) d t
(5.9)
or
R( t ) d t
(5.10)
(5.11)
MTTF =
or
s 0
where
MTTF
E(t)
s
R(s)
is
is
is
is
the
the
the
the
Example 5.1
Assume that the failure rate of a microprocessor, , is constant. Obtain expressions
for the microprocessor reliability, mean time to failure, and using the reliability
function prove that the microprocessor failure rate is constant.
Thus, substituting for (t) into Equation (5.8) yields
R( t ) = e
dt
o
(5.12)
= e t
By inserting Equation (5.12) into Equation (5.10), we get
MTTF = e t d t
o
(5.13)
1
e
( ) e t
(5.14)
=
Thus, Equations (5.12) and (5.13) represent expressions for microprocessor reliability and mean time to failure, respectively, and Equation (5.14) proves that the
microprocessor failure rate is constant.
Example 5.2
Using Equation (5.12), prove that the result obtained by utilizing Equation (5.11) is
the same as the one given by Equation (5.13).
Thus, the Laplace transform of Equation (5.12) is
R (s) = 1 (s + )
(5.15)
(5.16)
Equations (5.13) and (5.16) results are identical; thus, it proves that Equations (5.10)
and (5.11) give the same result.
Example 5.3
Assume that the failure rate of an automobile is 0.0004 failures/h. Calculate the
automobile reliability for a 15-h mission and mean time to failure.
MTTF = e ( 0.0004 ) t d t
o
1
(0.0004)
= 2, 500 h
Thus, the reliability and mean time to failure of the automobile are 0.994 and
2,500 h, respectively.
5.4.1
SERIES NETWORK
This is the simplest reliability network and its block diagram is shown in Figure 5.2.
Each block in the figure represents a unit/component. If any one of the unit fails,
the series system fails. In other words, all the units must work normally for the
success of the series system.
If we let Ei denote the event that the ith unit is successful, then the reliability
of the series system is expressed by
R S = P ( E1 E 2 E 3 n E
(5.17)
where
is the series system reliability.
RS
P (E1 E2 E3 - - - En) is the occurrence probability of events E1, E2, E3, - - -,
and En.
(5.18)
where
P (Ei) is the probability of occurrence of event Ei; for i = 1, 2, 3, , n.
If we let Ri = P (Ei) for i = 1, 2, 3, , n in Equation (5.18) becomes
R S = R1 R 2 R 3 nR
(5.19)
Ri
i =1
where
Ri is the unit reliability; for i = 1, 2, 3, , n.
Since normally the value of Ri is between zero and one, the series system reliability
decreases with the increasing value of n.
For constant failure rate, i, of unit i, the reliability of the unit i is given by
Ri = e
i t
(5.20)
where
Ri (t) is the reliability of unit i at time t.
Thus, inserting Equation (5.20) into Equation (5.19) yields
n
RS (t ) = e
i t
i =1
(5.21)
where
Rs (t) is the series system reliability at time t.
By substituting the above equation into Equation (5.10), we get the following expression for the series system mean time to failure:
n
MTTFS =
i t
i =1
(5.22)
n
=1
i =1
dt
Using Equation (5.21) in Equation (5.4) yields the following expression for the series
system hazard rate:
1
S (t ) =
i t
i =1
i =1
i t
i e i =1
(5.23)
i =1
5.4.2
PARALLEL NETWORK
In this case all the n units operate simultaneously and at least one such unit must
work normally for the system success. The n-unit parallel system block diagram is
shown in Figure 5.3. Each block in the figure denotes a unit/component.
If we let Ei denote the event that the ith unit is unsuccessful, then the failure
probability of the parallel system is expressed by
Fp = P ( E1 E 2 E3
n E
(5.24)
where
is the parallel system failure probability.
Fp
P (E1 E2 E3 En) is the occurrence probability of failure events E1, E2, E3,
, En.
For independent units, Equation (5.24) is written as
Fp = P ( E1 ) P ( E 2 )nP ( E3 ) P ( E
(5.25)
where
P (E1) is the probability of occurrence of failure event E1; for i = 1, 2, 3, , n.
Fp
is the failure probability of the parallel system.
If we let Fi = P(E1) for i = 1, 2, 3, , n in Equation (5.25) becomes
Fp = F1 F2 F3 Fn
(5.26)
where
Fi is the failure probability of unit i; for i = 1, 2, 3, , n.
Substracting Equation (5.26) from unity we get
n
R p = 1 Fp = 1
(5.27)
i =1
where
Rp is the parallel system reliability.
For constant failure rates of units, subtracting Equation (5.20) from unity and then
substituting it into Equation (5.27) yields
n
R p (t ) = 1
(1 e )
i t
(5.28)
i =1
where
Rp (t) is the parallel system reliability at time t.
For identical units, integrating Equation (5.28) over the time interval [0, ] yields
the following formula for the parallel system mean time to failure:
MTTFp =
[1 (1 e ) ] d t
t n
(5.29)
n
1 i
i =1
where
MTTFp is the parallel system mean time to failure.
Subtracting the engine probability of success from unity, we get the following
engine failure probability:
Fe = 1 0.95
= 0.05
Inserting the above value and the other given data into Equation (5.27) yields
R p = 1 (0.05)
= 0.9975
Thus, the aircrafts reliability with respect to engines is 0.9975.
Example 5.6
A system is composed of two independent units in parallel. The failure rates of units
A and B are 0.002 failures per hour and 0.004 failures per hour, respectively.
Calculate the system reliability for a 50-h mission and mean time to failure.
Let A be the failure rate of unit A and B the failure rate of unit B. Thus, for
n = 2, using Equation (5.28) we get
+ t
R p (t ) = e A t + e B t e ( A B )
(5.30)
[e
A t
MTTF =
=
+ t
+ e B t e ( A B ) d t
1
1
1
+
A B A + B
1
1
1
+
= 583.33 h
Thus, the system reliability and mean time to failure are 0.9827 and 583.33 h,
respectively.
5.4.3
r-OUT-OF-n NETWORK
This is another form of redundancy in which at least r units out of a total of n units
must work normally for the system success. Furthermore, all the units in the system
are active. The parallel and series networks are special cases of this network for
r = 1 and r = n, respectively.
Using the Binomial distribution, for independent and identical units, the reliability of the r-out-of-n network is given by
n
Rr n =
i R (1 R)
i
n i
(5.31)
i=r
where
n
n!
=
i ( n i )! i !
(5.32)
R r n (t ) =
n
i e (1 e )
t n i
i t
(5.33)
i=r
MTTFr n =
n n
i t
1 e t
e
i=r r
n i
dt
(5.34)
1 i
i=r
where
MTTFr/n is the mean time to failure of the r-out-of-n system.
Example 5.7
A computer system has three independent and identical units in parallel. At least
two units must work normally for the system success. Calculate the computer system
mean time to failure, if the unit failure rate is 0.0004 failures per hour.
1 i
1
MTTF2 3 =
(0.0004)
=
i=2
+
(0.0004) 2 3
= 2083.33 h
Thus, the computer system mean time to failure is 2083.33 h.
5.4.4
STANDBY REDUNDANCY
This is another type of redundancy used to improve system reliability. In this case,
one unit operates and m units are kept in their standby mode. The total system
contains (m + 1) units and as soon as the operating unit fails, the switching mechanism detects the failure and turns on one of the standby units. The system fails
when all the standbys fail. For independent and identical units, perfect switching
and standby units, and unit time dependent failure rate, the standby system reliability
is given by
m
Rsd (t ) =
t
(t)d t
(t ) d t e o
i!
i=0
(5.35)
where
Rsd (t) is the standby system reliability.
m
is the number of standby units.
For constant unit failure rate, Equation (5.35) becomes
m
Rsd (t ) =
( t ) e
i
i!
(5.36)
i=0
MTTFsd =
( t )i e t i ! d t
i=0
= (m + 1)
(5.37)
where
MTTFsd is the standby system mean time to failure.
Example 5.8
Assume that a standby system has two independent and identical units: one operating,
another on standby. The unit failure rate is 0.006 failures per hour. Calculate the
system reliability for a 200-h mission and mean time to failure, if the switching
mechanism never fails and the standby unit remains as good as new in its standby
mode.
By substituting the specified data into Equation (5.36) yields
l
Rsd (200) =
[(0.006) (200e )]
( 0.006 ) ( 200 )
i!
i=0
= 0.6626
Similarly, using the given data in Equation (5.37) we get
MTTFsd = 2 (0.006)
= 333.33 h
Thus, the standby system reliability and mean time to failure are 0.6626 and
333.333 h, respectively.
5.4.5
BRIDGE NETWORK
R br = 2
R + R R R
Ri +
i =1
i=2
+ R1 R 4 + R 2 R 5
(5.38)
R R R R R R R R R R
i
i=2
i =1
i =1
i=3
where
Rbr is the bridge network reliability.
Ri is the ith unit reliability; for i = 1, 2, 3, , 5.
For identical units, Equation (5.38) simplifies to
R br = 2 R 5 5 R 4 + 2 R 3 + 2 R 2
(5.39)
(5.40)
where
is the unit failure rate.
By integrating Equation (5.40) over the time interval [0, ], we get the following
formula for the bridge network mean time to failure:
MTTFbr = 49 60
(5.41)
where
MTTFbr is the bridge network mean time to failure.
Example 5.9
Assume that five independent and identical units form a bridge configuration. The
failure rate of each unit is 0.0002 failures per hour. Calculate the configuration
reliability for a 500-h mission.
Substituting the given values into Equation (5.40) yields
R br (500) = 2 e 5( 0.0002 ) ( 500 ) 5 e 4 (.0 0002 ) ( 500 ) + 2 e 3( 0.0002 ) ( 500 ) + 2 e 2 ( 0. 0002 ) ( 500 )
= 0.9806
Thus, the bridge configuration reliability is 0.9806.
5.5.1
This is probably the simplest approach to determine the reliability of systems made
up of independent series and parallel subsystems. However, the subsystems forming
bridge configurations can also be handled by first applying the delta-star method [10].
Nonetheless, the network reduction approach sequentially reduces the series and
parallel subsystems to equivalent hypothetical single units until the entire system
under consideration itself becomes a single hypothetical unit. The following example
demonstrates this method.
Example 5.10
An independent unit network representing an engineering system is shown in
Figure 5.5 (i). The reliability Ri of unit i; for i = 1, 2, 3, , 6 is specified. Calculate
the network reliability by using the network reduction approach.
First we have identified subsystems A, B, C, and D of the network as shown in
Figure 5.5 (i). The subsystems A and B have their units in series; thus, we reduce
them to single hypothetical units as follows:
R A = R1 R 3 = (0.4) (0.6) = 0.24
and
R B = R 2 R 4 = (0.5) (0.8) = 0.40
where
RA is the reliability of subsystem A.
RB is the reliability of subsystem B.
The reduced network is shown in Figure 5.5 (ii). Now, this network is composed of
two parallel subsystems C and D. Thus, we reduce both subsystems to single
hypothetical units as follows:
R C = 1 (1 R A ) (1 R B ) = 1 (1 0.24) (1 0.4)
= 0.5440
and
R D = 1 (1 R 5 ) (1 R 6 ) = 1 (1 0.7) (1 0.9)
= 0.97
FIGURE 5.5 Diagrammatic steps of the network reduction approach: (i) Original network;
(ii) reduced network I; (iii) reduced network II; (iv) single hypothetical unit.
where
RC is the subsystem C reliability.
RD is the subsystem D reliability.
Figure 5.5. (iii) depicts the reduced network with the above calculated values. This
resulting network is a two unit series system and its reliability is given by
R S = R C R D = (0.5440) (0.97)
= 0.5277
The single hypothetical unit shown in Figure 5.5 (iv) represents the reliability
of the whole network given in Figure 5.5 (i). More specifically, the entire network
is reduced to a single hypothetical unit. Thus, the total network reliability is 0.5277.
5.5.2
DECOMPOSITION APPROACH
This approach is used to determine reliability of complex systems, which it decomposes into simpler subsystems by applying the conditional probability theory. Subsequently, the system reliability is determined by combining the subsystems reliability measures.
The basis for the approach is the selection of the key unit used to decompose a
given network. The efficiency of the approach depends on the selection of this key
unit. The past experience usually plays an important role in its selection.
The method starts with the assumption that the key unit, say k, is replaced by
another unit that is 100% reliable or never fails and then it assumes the key unit k
is completely removed from the network or system. Thus, the overall system or
network reliability is expressed by
()
(5.42)
where
Rn
P (k)
P (k)
P ()
is
is
is
is
the
the
the
the
Next, we replace this key unit k with a unit that never fails. Consequently, the
Figure 5.6 network becomes a series-parallel system whose reliability is expressed
by
Rsp = 1 (1 R )
= 2 RR
2 2
(5.43)
2 2
Similarly, we totally remove the key unit k from Figure 5.6 and the resulting network
becomes a parallel-series system. This parallel-series system reliability is given by
R ps = 1 1 R 2
=2 R R
2
(5.44)
where
Rps is the parallel-series system reliability.
The reliability and unreliability of the key unit k, respectively, are given by
P (k) = R
(5.45)
()
(5.46)
and
P k = (1 R )
Rewriting Equation (5.42) in terms of our example, we get
R n = R R sp + (1 R ) R ps
(5.47)
Rn = R 2 R R2
+ (1 R ) 2 R 2 R 4
= 2 R 5 R +2 R +2 R
5
(5.48)
The above equation is for the reliability of the bridge network shown in Figure 5.6.
Also, it is identical to Equation (5.39).
5.5.3
DELTA-STAR METHOD
This is the simplest and very practical approach to evaluate reliability of bridge
networks. This technique transforms a bridge network to its equivalent series and
parallel form. However, the transformation process introduces a small error in the
end result, but for practical purposes it should be neglected [2].
Once a bridge network is transformed to its equivalent parallel and series form,
the network reduction approach can be applied to obtain network reliability. Nonetheless, the delta-star method can easily handle networks containing more than one
bridge configurations. Furthermore, it can be applied to bridge networks composed
of devices having two mutually exclusive failure modes [10, 11].
Figure 5.7 shows delta to star equivalent reliability diagram. The numbers 1, 2,
and 3 denote nodes, the blocks the units, and R() the respective unit reliability.
In Figure 5.7, it is assumed that three units of a system with reliabilities R12,
R13, and R23 form the delta configuration and its star equivalent configuration units
reliabilities are R1, R2, and R3.
Using Equations (5.19) and (5.27) and Figure 5.7, we write down the following
equivalent reliability equations for network reliability between nodes 1, 2; 2, 3; and
1, 3, respectively:
R1 R 2 = 1 (1 R12 ) (1 R13 R 23 )
(5.49)
R 2 R 3 = 1 (1 R 23 ) (1 R12 R13 )
(5.50)
R1 R 3 = 1 (1 R13 ) (1 R12 R 23 )
(5.51)
(5.52)
FIGURE 5.8 A five unit bridge network with specified unit reliabilities.
where
A = 1 (1 R12 ) (1 R13 R 23 )
(5.53)
B = 1 (1 R 23 ) (1 R12 R13 )
(5.54)
C = 1 (1 R13 ) (1 R12 R 23 )
(5.55)
R 2 = AB C
(5.56)
R 3 = BC A
(5.57)
Example 5.12
A five independent unit bridge network with specified unit reliability Ri; for i = a,
b, c, d, and e is shown in Figure 5.8. Calculate the network reliability by using the
delta-star method and also use the specified data in Equation (5.39) to obtain the
bridge network reliability. Compare both results.
In Figure 5.8 nodes labeled 1, 2, and 3 denote delta configurations. Using
Equations (5.52) through (5.57) and the given data, we get the following star
equivalent reliabilities:
R1 = AC B
= 0.9633
where
A = B = C = 1 (1 0.8) [1 (0.8) (0.8)] = 0.9280
R 2 = 0.9633
and
R 3 = 0.9633
Using the above results, the equivalent network to Figure 5.8 bridge network is
shown in Figure 5.9.
The reliability of Figure 5.9 network, Rbr, is
R br = R 3 1 (1 R1 R d ) (1 E 2 R e )
= 0.9126
= 0.9114
Both the reliability results are basically same, i.e., 0.9126 and 0.9114. All in all,
for practical purposes the delta-star approach is quite effective.
5.5.4
This is a very practically inclined method used during bid proposal and early design
phases to estimate equipment failure rate [12]. The information required to use this
method includes generic part types and quantities, part quality levels, and equipment
use environment. Under single use environment, the equipment failure rate can be
estimated by using the following equation [12]:
m
E =
Q ( F )
i
i =1
q i
(5.58)
where
E is the equipment failure rate, expressed in failures/106 h.
m is the number of different generic part/component classifications in the
equipment under consideration.
g is the generic failure rate of generic part i expressed in failures/106 h.
Qi is the quantity of generic part i.
Fq is the quality factor of generic part i.
Reference 12 presents tabulated values for g and Fq .
Failure Rate Estimation of an Electronic Part
As the design matures, more information becomes available, the failure rates of
equipment components are estimated. Usually, in the case of electronic parts, the
MIL-HDBK-217 [12] is used to estimate the failure rate of electronic parts. The
failure rates are added to obtain total equipment failure rate. This number provides
a better picture of the actual failure rate of the equipment under consideration than
the one obtained through using Equation (5.58).
An equation of the following form is used to estimate failure rates of many
electronic parts [12]:
p = b e q
(5.59)
where
p is the part failure rate.
b is the base failure rate and is normally defined by a model relating the
influence of temperature and electrical stresses on the part under consideration.
e is the factor that accounts for the influence of environment.
q is the factor that accounts for part quality level.
For many electronic parts, the base failure rate, b, is calculated by using the
following equation:
b = C exp [ E kT]
where
C
E
k
T
is
is
is
is
a constant.
the activation energy for the process.
the Boltzmanns constant.
the absolute temperature.
(5.60)
5.5.5
MARKOV METHOD
This is a powerful reliability analysis tool and it can be used in more cases than any
other method. The method is quite useful to model systems with dependent failure
and repair modes. Markov method is widely used to model repairable systems with
constant failure and repair rates. However, with the exception of a few special cases,
the technique breaks down for a system having time dependent failure and repair
rates. In addition, a problem may occur in solving a set of differential equations for
large and complex systems. The following assumptions are associated with the
Markov approach [8]:
All occurrences are independent of each other.
The probability of transition from one system state to another in the finite
time interval t is given by t, where the is the transition rate (i.e.,
failure or repair rate) from one system state to another.
The probability of more than one transition occurrence in time intervalt
from one state to another is very small or negligible (e.g., (t) (t) 0).
The Markov method is demonstrated by solving the following example.
Example 5.13
Assume that an engineering system can either be in an operating or a failed state.
It fails at a constant failure rate, , and is repaired at a constant repair rate, . The
system state space diagram is shown in Figure 5.10. The numerals in box and circle
denote the system state. Obtain expressions for system time dependent and steady
state availabilities and unavailabilities by using the Markov method.
Using the Markov method, we write down the following equations for state 0
and state 1, respectively, shown in Figure 5.10.
P0 (t + t ) = P0 (t ) (1 t ) + P1 (t ) t
(5.61)
P1 (t + t ) = P1 (t ) (1 t ) + P0 (t ) t
(5.62)
where
t
t
(1 t)
t
(1 t)
P0 (t + t)
is time.
is the probability of system failure in finite time intervalt.
is the probability of no failure in finite time intervalt.
is the probability of system repair in finite time intervalt.
is the probability of no repair in finite time intervalt.
is the probability of the system being in operating state 0 at time
(t + t).
P1 (t + t) is the probability of the system being in failed state 1 at time (t + t).
Pi (t)
is the probability that the system is in state i at time t, for i = 0, 1.
(5.63)
d P1 (t )
+ P1 (t ) = P0 (t )
dt
(5.64)
P1 (t ) =
+ t
e ( )
(5.65)
+ t
e ( )
( + ) ( + )
(5.66)
( + ) ( + )
Thus, the system time dependent availability and unavailability, respectively, are
A (t ) = P0 (t ) =
+ t
+
e ( )
( + ) ( + )
(5.67)
and
UA (t ) = P1 (t ) =
( + ) ( + )
+ t
e ( )
where
A (t) is the system time dependent availability.
UA (t) is the system time dependent unavailability.
(5.68)
The system steady state availability and unavailability can be obtained by using any
of the following three approaches:
Approach I: Letting time t go to infinity in Equations (5.67) and (5.68),
respectively.
Approach II: Setting the derivatives of Equations (5.63) and (5.64)
equal to zero and then discarding any one of the resulting two equations
and replacing it with P0 + P1 = 1. The solutions to the ultimate equations
will be system steady state availability (i.e., A = P0 ) and unavailability
(i.e., UA = P1 ).
Approach III: Taking Laplace transforms of Equations (5.63) and
(5.64) and then solving them for P0(s), the Laplace transform of probability that the system is in operating state at time t, and P1(s), the Laplace
transform of probability that the system is in failed state at time t. Multiplying P0(s) and P1(s) with the Laplace transform variable s and then
letting s in sP0(s) and sP1(s) go to zero result in system steady state
availability (i.e., A = P0 ) and unavailability (i.e., UA = P1), respectively.
Thus, in our case applying Approach I to Equations (5.67) and (5.68), we get
A = lim A (t ) =
t
( + )
(5.69)
and
UA = lim UA (t ) =
t
( + )
(5.70)
where
A is the system steady state availability.
UA is the system steady state unavailability.
Since =
1
MTTF
A=
and =
1
MTTR
MTTF
System uptime
=
MTTF + MTTR System uptime + System downtime
where
MTTF is the system mean time to failure.
MTTR is the system mean time to repair.
(5.71)
and
UA =
MTTR
System downtime
=
MTTF + MTTR System uptime + System downtime
(5.72)
Thus, the system time dependent and steady state availabilities are given by Equations (5.67) and (5.69) or (5.71), respectively. Similarly, the system time dependent
and steady state unavailabilities are given by Equations (5.68) and (5.70) or (5.72),
respectively.
5.6.1
HYBRID METHOD
This method is the result of combining two approaches: similar familiar systems
and factors of influence. The resulting method incorporates benefits of the other two
methods; thus, it becomes more attractive.
The basis for the similar familiar systems reliability allocation approach is the
familiarity of the designer with similar systems or sub-systems. In addition, failure
data collected on similar systems from various sources can also be used during the
allocation process. The main drawback of this approach is to assume that reliability
and life cycle cost of previous similar designs were adequate.
The factors of influence method is based upon the following factors that are
considered to effect the system reliability:
Complexity/Time: The complexity relates to the number of subsystem
parts and the time to the relative operating time of the item during the
entire system functional period.
Failure criticality. This factor considers the criticality of the item failure
on the system. For example, some auxiliary instrument failure in an
aircraft may not be as crucial as the engine failure.
Environment. This factor takes into consideration the susceptibility or
exposure of items to environmental conditions such as temperature,
humidity, and vibration.
State-of-the-Art. This factor takes into consideration the advancement in
the state-of-the-art for a specific item.
In using the above influence factors, each item is rated with respect to each of these
influence factors by assigning a number from 1 to 10. The assignment of 1 means the
item under consideration is least affected by the factor in question and 10 means the
item is most affected by the same influence factor. Ultimately, the reliability is allocated
by using the weight of these assigned numbers for all influence factors considered.
Now, it should be obvious to the reader that the hybrid method is better than
similar familiar systems and factors of influence methods because it uses data from
both of these approaches.
5.6.2
This method is concerned with allocating failure rates to system components when
the system required failure rate is known. The following assumptions are associated
with this method:
System components form a series configuration.
System components fail independently.
Time to component failure is exponentially distributed.
Thus, the system failure rate using Equation (5.23) is
n
S =
i =1
where
n is the number of components in the system.
S is the system failure rate.
i is the failure rate of system component i; for i = 1, 2, 3, , n.
(5.73)
If the system required failure rate is sr , then allocate component failure rate such that
n
*
i
sr
(5.74)
i =1
where
*i is the failure rate allocated to component i; for i = 1, 2, 3, , n.
The following steps are associated with this method:
1. Estimate the component failure rates i for i = 1, 2, 3, , n, using the
past data.
2. Calculate the relative weight, i , of component i using the preceding step
failure rate data and the following equation:
i =
, for i = 1, 2,, n
(5.75)
i =1
= 1
i
(5.76)
i =1
(5.77)
It must be remembered that Equation (5.77) is subject to the condition that the
equality holds in Equation (5.74).
Example 5.14
Assume that a military system is composed of five independent subsystems in series
and its specified failure rate is 0.0006 failures/h. The estimated failure rates from
past experience for subsystems 1, 2, 3, 4, and 5 are 1 = 0.0001 failures/h, 2 =
0.0002 failures/h, 3 = 0.0003 failures/h, 4 = 0.0004 failures/h, and 5 = 0.0005
failures/h, respectively. Allocate the specified system failure rate to five subsystems.
Using Equation (5.73) and the given data, we get the following estimated military
system failure rate:
5
s =
i =1
5.7 PROBLEMS
1. Describe the bathtub hazard rate curve and the reasons for its useful life
region failures.
2. Prove that the item mean time to failure (MTTF) is given by
MTTF = lim R (s)
s 0
(5.78)
where s
is the Laplace transform variable.
R(s) is the Laplace transform of the item reliability.
3. Prove that the mean time to failure of a system is given by
MTTF =
1
1 + 2 + + n
(5.79)
where
n is the number of units.
i is the unit failure rate; for i = 1, 2, 3, , n.
Write down your assumptions.
4. A parallel system has n independent and identical units. Obtain an expression for the system hazard rate, if each units times to failure are exponentially distributed. Compare the end result with the one for the series
system under the same condition.
5. Prove that when an items times to failure are exponentially distributed,
its failure rate is constant.
6. A system has four independent and identical units in parallel. Each units
failure rate is 0.0005 failures/h. Calculate the system reliability, if at least
two units must operate normally for the system success during a 100-h
mission.
7. Compare the mean time to failure of k independent and identical unit series
and standby systems when unit times to failure are exponentially distributed.
8. Calculate the reliability of the Figure 5.11 network using the delta-star
approach. Assume that each block in the figure denotes a unit with reliability 0.8 and all units fail independently.
9. For the system whose transition diagram is shown in Figure 5.10, obtain
steady state probability equations by applying the final value theorem to
Laplace transforms (i.e., Approach III of Examples 5.13).
10. An aerospace system is made up of seven independent subsystems in series
and it specified failure rate 0.009 failures/h. Subsystems 1, 2, 3, 4, 5, 6, and
7 estimated failure rates from previous experience are 0.001 failures/h,
0.002 failures/h, 0.003 failures/h, 0.004 failures/h, 0.005 failures/h,
0.006 failures/h, and 0.007 failures/h, respectively. Allocate the specified
system failure rate to seven subsystems.
5.8 REFERENCES
1. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Eds., Handbook of Reliability
Engineering and Management, McGraw-Hill, New York, 1996.
2. Dhillon, B.S., Reliability in Systems Design and Operation, Van Nostrand Reinhold
Company, New York, 1983.
3. AMCP 706-196, Engineering Design Handbook: Development Guide for Reliability,
Part II: Design for Reliability, U.S. Army Material Command (AMC), Washington,
D.C., 1976.
4. Reliability Design Handbook, RDG-376, Reliability Analysis Center, Rom Air Development Center, Griffis Air Force Base, Rome, New York, 1976.
5. Dhillon, B.S., A hazard rate model, IEEE Trans. Reliability, 28, 150, 1979.
6. Dhillon, B.S., Mechanical Reliability: Theory, Models, and Applications, American
Institute of Aeronautics and Astronautics, Washington, D.C., 1988.
7. Lipp, J.P., Topology of switching elements vs. reliability, Trans. IRE Reliability
Quality Control, 7, 21-34, 1957.
8. Shooman, M.L., Probabilistic Reliability: An Engineering Approach, McGraw-Hill,
New York, 1968.
9. Dhillon, B.S. and Singh, C., Engineering Reliability: New Techniques and Applications, John Wiley & Sons, New York, 1981.
10. Dhillon, B.S., The Analysis of the Reliability of Multistate Device Networks, Ph.D.,
Dissertation, 1975. Available from the National Library of Canada, Ottawa.
11. Dhillon, B.S. and Proctor, C.L. Reliability analysis of multistate device networks,
Proc. Annu. Reliability Maintainability Symp., 31-35, 1976.
12. MIL-HDBK-217, Reliability Prediction of Electronic Equipment, Department of
Defense, Washington, D.C.
13. Dhillon, B.S., Systems Reliability, Maintainability and Management, Petrocelli
Books, New York, 1983.
Failure Modes
and Effect Analysis
6.1 INTRODUCTION
Failure modes and effect analysis (FMEA) is a powerful design tool to analyze
engineering systems and it may simply be described as an approach to perform
analysis of each potential failure mode in the system to examine the results or effects
of such failure modes on the system [1]. When FMEA is extended to classify each
potential failure effect according to its severity (this incorporates documenting catastrophic and critical failures), the method is known as failure mode effects and
criticality analysis (FMECA).
The history of FMEA goes back to the early 1950s with the development of
flight control systems when the U.S. Navys Bureau of Aeronautics, in order to
establish a mechanism for reliability control over the detail design effort, developed
a requirement called Failure Analysis [2]. Subsequently, Coutinho [3] coined the
term Failure Effect Analysis and the Bureau of Naval Weapons (i.e., successor to
the Bureau of Aeronautics) introduced it into its new specification on flight controls.
However, FMECA was developed by the National Aeronautics and Astronautics
Administration (NASA) to assure the desired reliability of space systems [4].
In the 1970s, the U.S. Department of Defense directed its effort to develop a
military standard entitled Procedures for Performing a Failure Mode, Effects, and
Criticality Analysis [5]. Today FMEA/FMECA methods are widely used in the
industry to conduct analysis of systems, particularly for use in aerospace, defense,
and nuclear power generation. Reference 6 presents a comprehensive list of publications on FMEA/FMECA. This chapter discusses different aspects of FMEA/FMECA.
6.3.1
DESIGN-LEVEL FMEA
The purpose of performing design-level FMEA is to help identify and stop product
failures related to design. This type of FMEA can be carried out upon componentlevel/subsystem-level/system-level design proposal and its intention is to validate
6.3.2
SYSTEM-LEVEL FMEA
This is the highest-level FMEA that can be performed and its purpose is to identify
and prevent failures related to system/subsystems during the early conceptual design.
Furthermore, this type of FMEA is carried out to validate that the system design
specifications reduce the risk of functional failure to the lowest level during the
operational period.
Some benefits of the system-level FMEA are identification of potential systemic
failure modes due to system interaction with other systems and/or by subsystem
interactions, selection of the optimum system design alternative, identification of
potential system design parameters that may incorporate deficiencies prior to releasing hardware/software to production, a systematic approach to identify all potential
effects of subsystem/assembly part failure modes for incorporation into design-level
FMEA, and a useful data bank of historical records of the thought processes as well
as of actions taken during product development efforts.
6.3.3
PROCESS-LEVEL FMEA
6.4.1
DEFINE SYSTEM
AND ITS
ASSOCIATED REQUIREMENTS
This is concerned with defining the system under consideration and the definition
normally incorporates a breakdown of the system into blocks, block functions, and
the interface between them. Usually this early in the program a good system definition does not exist and the analyst must develop his/her own system definition
using documents such as trade study reports, drawings, and development plans and
specifications.
6.4.2
These are established as to which the FMEA is conducted. Usually, developing the
ground rules is a quite straightforward process when the system definition and
mission requirements are reasonably complete. Nonetheless, examples of the ground
rules include primary and secondary mission objectives statement, limits of environmental and operational stresses, statement of analysis level, delineation of mission phases, definition of what constitutes failure of system hardware parts, and the
detail of the coding system used.
6.4.3
AND ITS
ASSOCIATED
This is concerned with the preparation of the description of the system under
consideration. Such description may be grouped into two parts:
Narrative functional statement. This is prepared for each subsystem and
component as well as for the total system. It provides narrative description
of each items operation for each operational mode/mission phase. The
degree of the description detail depends on factors such as an items
application and the uniqueness of the functions performed.
System block diagram. The purpose of this block diagram is to determine
the success/failure relationships among all the system components; thus,
it graphically shows total system components to be analyzed as well as
the series and redundant relationships among the components. In addition,
the block diagram shows the entire systems inputs and outputs and each
system elements inputs and outputs.
6.4.4
AND
This is concerned with performing analysis of the failure modes and their effects.
A form such as that shown in Figure 6.3 is used as a worksheet to assure systematic
and thorough coverage of all failure modes. Even though all the terms used in that
form are considered self-explanatory, the terms Compensating provisions and
Criticality classification are described below.
Compensating provisions. These provisions, i.e., design provisions or
operator actions, concerning circumventing or mitigating the failure effect
should be identified and evaluated [5].
Criticality classification. This is concerned with the categorization of
potential effect of failure. For example [4],
People may lose their lives due to a failure.
Failure may cause mission loss.
Failure may cause delay in activation.
Failure has no effect.
6.4.5
6.4.6
DOCUMENT
THE
ANALYSIS
This is the final step of the FMEA performing process and is concerned with the
documentation of analysis. This step is at least as important as the other previous
five steps because poor documentation can lead to ineffectiveness of the FMEA
process. Nonetheless, the FMEA report incorporates items such as system definition,
FMEA associated ground rules, failure modes and their effects, description of the
system (i.e., including functional descriptions and block diagrams), and critical items
list.
6.5.1
RPN TECHNIQUE
This method calculates the risk priority number for a part failure mode using three
factors: (1) failure effect severity, (2) failure mode occurrence probability, and
TABLE 6.1
Failure Detection Ranking
Item
No.
Likelihood
of detection
1
2
3
4
5
6
Very high
High
Moderate
Low
Very low
Detectability
absolutely uncertain
Rank meaning
Rank
1, 2
3, 4
5, 6
7, 8
9
10
(3) failure detection probability. More specifically, the risk priority number is computed by multiplying the rankings (i.e., 110) assigned to each of these three factors.
Thus, mathematically the risk priority number is expressed by
RPN = (DR ) (OR ) (SR )
(6.1)
where
RPN is
DR is
OR is
SR is
the
the
the
the
Since the above three factors are assigned rankings from 1 to 10, the value of the
RPN will vary from 1 to 1000. Failure modes with a high RPN are considered to
be more critical; thus, they are given a higher priority in comparison to the ones
with lower RPN. Nonetheless, rankings and their interpretation may vary from one
organization to another. Tables 6.1 through 6.3 present rankings for failure detection,
failure mode occurrence probability, and failure effect severity used in one
organization [12], respectively.
6.5.2
This technique is often used in defense, aerospace, and nuclear industries to prioritize
the failure modes of the item under consideration so that appropriate corrective
measures can be undertaken [5]. The technique requires the categorization of the
failure-mode effect severity and then the development of a critical ranking.
Table 6.4 presents classifications of failure mode effect severity.
In order to assess the likelihood of a failure-mode occurrence, either a qualitative
or a quantitative approach can be used. This is determined by the availability of
specific parts configuration data and failure rate data. The qualitative method is used
TABLE 6.2
Failure Mode Occurrence Probability
Item
No.
Ranking
term
1
2
Remote
Low
Moderate
High
Very high
Occurrence
probability
6
<1 in 10
1 in 20,000
1 in 4,000
1 in 1,000
1 in 400
1 in 80
1 in 40
1 in 20
1 in 8
1 in 2
Rank
1
2
3
4
5
6
7
8
9
10
TABLE 6.3
Severity of the Failure-Mode Effect
Item
No.
Failure effect
severity category
Minor
Low
3
4
Moderate
High
Very high
Rank
1
2, 3
4, 5, 6
7, 8
9, 10
when there are no specific failure rate data. Thus, in this approach the individual
occurrence probabilities are grouped into distinct logically defined levels, which
establish the qualitative failure probabilities. Table 6.5 presents occurrence probability levels [5, 9]. For the purpose of identifying and comparing each failure mode
to all other failure modes with respect to severity, a critical matrix is developed as
shown in Figure 6.3.
The criticality matrix is developed by inserting item/failure mode identification
number values in matrix locations denoting the severity category classification and
either the criticality number (Ki) for the failure modes of an item or the occurrence
level probability. The distribution of criticality of item failure modes is depicted by
the resulting matrix and the matrix serves as a useful tool for assigning corrective
measure priorities. The direction of the arrow, originating from the origin, shown in
TABLE 6.4
Failure Mode Effect Severity Classifications
Item
No.
Classification
term or name
Catastrophic
Critical
Marginal
Minor
Classification description
The occurrence of failure may result in death or equipment
loss
The occurrence of failure may result in major property
damage/severe injury/major system damage ultimately
leading to mission loss
The occurrence of failure may result in minor injury/minor
property damage/minor system damage, etc.
The failure is not serious enough to lead to injury/system
damage/property damage, but it will result in repair or
unscheduled maintenance
Classification
No.
A
B
C
D
TABLE 6.5
Qualitative Failure Probability Levels
Item
No.
Probability
level
Frequent
II
Reasonably probable
III
Occasional
IV
Remote
Extremely unlikely
Term description
High probability of occurrence during the item
operational period
Moderate probability of occurrence during the item
operational period
Occasional probability of occurrence during the item
operational period
Unlikely probability of occurrence during the item
operational period
Essentially zero chance of occurrence during the item
operational period
Figure 6.5, indicates the increasing criticality of the item failure and the darkened
region in the figure shows the approximate desirable design region. For severity
classifications A and B, the desirable design region has low occurrence probability
or criticality number. On the other hand, for severity classifications C and D failures,
higher probabilities of occurrence can be tolerated. Nonetheless, failure modes
belonging to classifications A and B should be eliminated altogether or at least their
probabilities of occurrence be reduced to an acceptable level through design changes.
The quantitative approach is used when failure mode and probability of occurrence data are available. Thus, the failure mode critical number is calculated using
the following equation:
K fm = F T
(6.2)
where
Kfm is the failure-mode criticality number.
is the failure-mode ratio or the probability that the component/part fails
in the particular failure mode of interest. More specifically, it is the
fraction of the part failure rate that can be allocated to the failure mode
under consideration and when all failure modes of an item under consideration are specified, the sum of the allocations equals unity. Table 6.6
presents failure mode apportionments for certain parts [10, 11, 13].
F is the conditional probability that the failure effect results in the indicated
severity classification or category, given that the failure mode occurs.
The values of F are based on an analysts judgment and these values are
quantified according to Table 6.7 guidelines.
T is the item operational time expressed in hours or cycles.
is the item/part failure rate.
The item criticality number Ki is calculated for each severity class, separately.
Thus, the total of the criticality numbers of all failure modes of an item resulting in
failures in the severity class of interest is given by
Ki =
j =1
j =1
(k fm )j = (F T)j
(6.3)
where
n is the item failure modes that fall under the severity classification under
consideration.
TABLE 6.6
Failure Mode Apportionments for Selective Items
Item
no.
Item
description
Relay
Relief valve
Incandescent lamp
Fixed resistor
Transformer
Diode
Hydraulic valve
Contact failure
Open coils
Premature open
Leaking
Catastrophic (filament breakage, glass
breakage)
Degradation (loss of filament emission)
Open
Parameter change
Short
Shorted turns
Open circuit
Short circuit
Intermittent circuit
Open circuit
Other
Stuck closed
Stuck open
Leaking
Approximate probability
of occurrence
(or probability value for )
0.75
0.05
0.77
0.23
0.10
0.90
0.84
0.11
0.05
0.80
0.05
0.75
0.18
0.06
0.01
0.12
0.11
0.77
TABLE 6.7
Failure Effect Probability
Guideline Values
Item
no.
Failure effect
description
Probability
value of F
1
2
3
4
No effect
Actual loss
Probable loss
Possible loss
0
1.00
0.10 < F < 1.00
0 < F 0.10
It is to be noted that when item/part failure mode results in multiple severityclass effects each with its own occurrence probability, then in the calculation of Ki
only the most important is used [14]. This can lead to erroneously low Ki values for
the less critical severity categories. In order to rectify this error, it is recommended
to compute F values for all severity categories associated with a failure mode and
ultimately include only contributions of Ki for category B, C, and D failures [9, 14].
Failure modes, causes, and rates (factory database, field experience database).
Failure effects (design engineer, reliability engineer, safety engineer).
Item identification numbers (parts list).
Failure detection method (design engineer, maintainability engineer).
Function (customer requirements, design engineer).
Failure probability/severity classification (safety engineer).
Item nomenclature/functional specifications (parts list, design engineer).
Mission phase/operational mode (design engineer).
Usually, FMECA is expected to satisfy the needs of many groups during the
design process including design, quality assurance, reliability and maintainability,
customer representatives, internal company regulatory agency, system engineering,
testing, logistics support, system safety, and manufacturing.
processes, and applying the Pareto principle to the RPN is a misapplication of the
Pareto principle [15].
6.9 PROBLEMS
1. Write a short essay on the history of FMEA.
2. Define the following terms:
Failure mode
Single point failure
Severity
3. Discuss design-level FMEA and its associated benefits.
4. Describe the FMEA process.
5. Discuss the elements of a typical critical items list worksheet.
6. Describe the RPN method.
7. Identify sources for obtaining data for the following factors:
Item identification numbers
Failure effects
Failure probability/severity classification
8. List at least 10 advantages of performing FMEA.
9. What is the difference between FMEA and FMECA?
10. Who are the users of FMEA/FMECA?
6.10
REFERENCES
7.1 INTRODUCTION
Fault tree analysis (FTA) is widely performed in the industry to evaluate engineering
systems during their design and development, in particular the ones used in nuclear
power generation. A fault tree may simply be described as a logical representation
of the relationship of primary events that lead to a specified undesirable event called
the top event and is depicted using a tree structure with OR, AND, etc. logic gates.
The fault tree method was developed in the early 1960s by H.A. Watson of Bell
Telephone Laboratories to perform analysis of the Minuteman Launch Control
System. A study team at the Bell Telephone Laboratories further refined it and
Haasl [1] of the Boeing company played a pivotal role in its subsequent development.
In 1965 several papers related to the technique were presented at the System
Safety Symposium held at the University of Washington, Seattle [1]. In 1974, a
conference on Reliability and Fault Tree Analysis was held at the University of
California, Berkeley [2]. A paper appeared in 1978 providing a comprehensive list
of publications on Fault Trees [3]. The three books that described FTA in considerable depth appeared in 1981 [46]. Needless to say, since the inception of the fault
tree technique, many people have contributed to its additional developments.
This chapter describes different aspects of FTA in considerable depth.
FIGURE 7.1 Commonly used fault tree symbols: (i) rectangle, (ii) circle, (iii) diamond,
(iv) triangle A, (v) triangle B, (vi) AND gate, (vii) OR gate.
From time to time, there are many other symbols used to perform FTA. Most
of these symbols are discussed in References 6 through 8.
FIGURE 7.2 Fault tree for the top event: dark room or room without light.
Idempotent law
AA=A
(7.1)
A+A=A
(7.2)
X (Y + Z ) = XY + XZ
(7.3)
X + YZ = ( X + Y) ( X + Z )
(7.4)
AB = BA
(7.5)
A + B = B+ A
(7.6)
X + XY = X
(7.7)
X ( X + Y) = X
(7.8)
Distributive law
Commutative law
Absorption law
OR GATE
An m input fault events A1, A2, A3, , Am OR gate along with its output fault event
A0 in a Boolean expression is shown in Figure 7.3. Thus, mathematically, the output
fault event A0 of the m input fault event OR gate is expressed by
A 0 = A1 + A 2 + A 3 + + A m
(7.9)
where
Ai is the ith input fault event; for i = 1, 2, 3, , m.
AND GATE
A k input fault event X1, X2, X3, , Xk AND gate along with its output fault event
X0 in a Boolean expression is shown in Figure 7.4. Thus, mathematically, the output
fault event X0 of the k input fault event AND gate is expressed by
X 0 = X1 X 2 X 3 X k
where
Xi is the ith input fault event; for i = 1, 2, 3, , k.
(7.10)
Example 7.2
For the fault tree shown in Figure 7.2, develop a Boolean expression for the top
event: Room without light (i.e., fault event B10) using fault event identification
symbols B1, B2, B3, , B9 .
Using relationship (7.10), the Boolean expression for the intermediate fault event
B8 is
B8 = B1 B2 B3 B4
(7.11)
Similarly, utilizing relationship (7.9), the Boolean equation for the intermediate fault
event B9 is given by
B9 = B5 + B6
(7.12)
Again using relationship (7.9), the Boolean expression for the top fault event B10 is
expressed by
B10 = B7 + B8 + B9
7.6.1
(7.13)
A hypothetical fault tree with repeated fault events is shown in Figure 7.5 (i.e., the
basic fault event A is repeated). Thus, in this case, the repetition of A must be
eliminated prior to obtaining the quantitative reliability parameter results for the
fault tree. Otherwise, the quantitative values will be incorrect. The elimination of
repeated events can either be achieved by applying the Boolean algebra properties
such as presented in Section 7.5 or algorithms especially developed for this
purpose [6, 10-12].
One such algorithm is represented in Section 7.6.2.
Example 7.3
Use Boolean algebra properties to eliminate the repetition of the occurrence of event
A shown in Figure 7.5. Construct the repeated event free fault tree using the simplified Boolean expression for the top event.
Using Figure 7.5, we write down the following Boolean expressions:
X=A+B
(7.14)
Y = A+C
(7.15)
Z = XY
(7.16)
T = Z+D+E
(7.17)
(7.18)
FIGURE 7.6 The repeated event free fault tree developed using Equation (7.20).
(7.19)
(7.20)
The above equation represents the simplified Boolean expression for the top event
of the fault tree shown in Figure 7.5. The repeated event free fault tree constructed
using Equation (7.20) is shown in Figure 7.6.
7.6.2
ALGORITHM
FOR
One of the difficult problems facing the fault tree method is to obtain minimal cut
sets or eliminate repeated events of a fault tree. This section presents one algorithm
to obtain minimal cut sets of a fault tree [6, 1012].
A cut set may be described as a collection of basic events that will cause the
top fault tree event to occur. Furthermore, a cut set is said to be minimal if it cannot
be further minimized or reduced but it can still ensure the occurrence of the top
fault tree event.
The algorithm under consideration can either be used manually for simple fault
trees or computerized for complex fault trees with hundreds of gates and basic fault
events. In this algorithm, the AND gate increases the size of a cut set and the OR
gate increases the number of cut sets. The algorithm is demonstrated by solving the
following example:
Example 7.4
Obtain a repeated event free fault tree of the fault tree shown in Figure 7.7. In this
figure, the basic fault events are identified by the numerals and the logic gates are
labeled as GT0, GT1, GT2, , GT6.
The algorithm starts from the gate, GT0, just below the top event of the fault
tree shown in Figure 7.7. Usually, in a fault tree this gate is either OR or AND. If
this gate is an OR gate, then each of its inputs represents an entry for each row of
the list matrix. On the other hand, if this gate is an AND, then each of its inputs
denotes an entry for each column of the list matrix.
In our case, GT0 is an OR gate; thus, we start the formulation of the list matrix
by listing the gate inputs: 11, 12, GT1, and GT2 (output events) in a single column
but in separate rows as shown in Figure 7.8 (i). As any one input of the GT0 could
cause the occurrence of the top event, these inputs are the members of distinct cut
sets.
One simple rule associated with this algorithm is to replace each gate by its
inputs until all the gates in a given fault tree are replaced with the basic event entries.
The inputs of a gate could be the basic events or the outputs of other gates. Consequently, in our case, in order to obtain a fully developed list matrix, we proceed to
replace the OR gate GT1 of list matrix of Figure 7.8 (i) by its input events as separate
rows, as indicated by the dotted line in Figure 7.8 (ii).
Next, we replace the OR gate GT2 of the list matrix in Figure 7.8 (ii) by its
input events as indicated by the dotted line in Figure 7.8 (iii).
Similarly, we replace the OR gate GT3 of the list matrix in Figure 7.8 (iii) by
its input events 5 and 6 as identified by the dotted line in Figure 7.8 (iv).
The next gate GT4 to be replaced with its input events is an AND gate. The
dotted line in Figure 7.8 (v) indicates its input events. Again, the next gate, GT5, to
be replaced by its input events is an AND gate.
Its input events are indicated by the dotted line in Figure 7.8 (vi).
The last gate to be replaced by its input events is an OR gate GT6. Its inputs
are indicated by the dotted line in the list matrix of Figure 7.8 (vii). As shown in
the list matrix of Figure 7.8 (vii), the cut set 2 is a single event cut set. As only its
occurrence will result in the occurrence of the top event, we eliminate cut set {3, 4, 2}
from the list matrix because the occurrence of this cut set requires all events 3, 4, and 2
to occur. Consequently, the list matrix shown in Figure 7.8 (viii) represents the
minimal cut sets of the fault tree given in Figure 7.7.
The fault tree of the Figure 7.8 (viii) list matrix is shown in Figure 7.9. Now this
fault tree can be used to obtain the quantitative measures of the top or undesirable event.
FIGURE 7.9 A fault tree for the minimal cut sets of Figure 7.8 (viii).
OR GATE
Using Figure 7.3 for independent input events, the probability of occurrence of the
output fault event A0 is given by
P (A 0 ) = 1
{1 P (A )}
i
(7.21)
i =1
where
P (Ai) is the probability of occurrence of input fault event Ai; for i = 1, 2, 3,
, m.
For m = 2, Equation (7.21) reduces to
P (A 0 ) = 1
{1 P (A )}
i
i =1
(7.22)
For the probabilities of occurrence of fault events A1 and A2 less than 0.1,
Equation (7.22) may be approximated by
P (A 0 ) P (A1 ) + P (A 2 )
(7.23)
P (A )
i
(7.24)
i =1
AND GATE
Using Figure 7.4, for independent input fault events, the probability of occurrence
of the output fault event X0 is
P (X 0 ) =
P (X )
i
(7.25)
i =1
where
P (Xi) is the probability of occurrence of input fault event Xi; for i = 1, 2, 3,
, k.
7.7.1
In this case, the basic fault events, say, representing component failures of a system,
are not repaired but their probabilities of occurrence are known. Under such conditions, the probability evaluation of fault trees is demonstrated through Example 7.4.
Example 7.4
Assume that in Figure 7.2, the probability of occurrence of basic events B1, B2, B3,
B4, B5, B6, and B7 are 0.15, 0.15, 0.15, 0.15, 0.06, 0.08, and 0.04, respectively.
Calculate the probability of occurrence of the top event: room without light.
By inserting the given data into Equation (7.25), the probability of occurrence
of event B8 is
P (B8 ) = P (B1 ) P (B2 ) P (B3 ) P (B4 )
= (0.15) (0.15) (0.15) (0.15)
0.0005
Using the given data with Equation (7.22), we get the following probability of
occurrence of event B9:
FIGURE 7.10 Calculated fault event probability values of Figure 7.2 fault tree.
][
][
7.7.2
In this case, the basic fault events, say, representing component failures of a system,
are repaired and the failure and repair rates of the components are known. Thus,
using the Markov method, the unavailability of a component is given by
+ t
1 e ( )
+
UA (t ) =
(7.26)
where
UA (t) is the component unavailability at time t.
(7.27)
where
UA is the component steady state unavailability.
The above equation yields steady state probability of a component (that it will
be unavailable for service) when its failure and repair rates are known. Thus,
substituting Equation (7.27) into Equation (7.21), we get the steady state probability
of occurrence of the output fault event A0 of the OR gate as follows:
m
POS = 1
(1 UA )
i
i =1
m
= 1
i =1
(7.28)
i
1 +
i
i
where
POS is the steady state probability of occurrence of the OR gate output fault
event.
UAi is the steady state unavailability of component i; for i = 1, 2, 3, , m.
i is the constant failure rate of component i; for i = 1, 2, 3, , m.
i is the constant failure rate of component i; for i = 1, 2, 3, , m.
Similarly, inserting Equation (7.27) into Equation (7.25) yields the following result:
k
PaS =
UA
i =1
(7.29)
i =1
+
i
where
PaS is the steady state probability of occurrence of the AND gate output fault
event.
The probability evaluation of fault trees with repairable components using the above
equations is demonstrated by solving the following example [13]:
Example 7.5
Assume that in Figure 7.2 failure and repair (in parentheses) rates associated with basic
events B1, B2, B3, B4, B5, B6, and B7 are 0.0002 failures/h (4 repairs/h), 0.0002 failures/h
(4 repairs/h), 0.0002 failures/h (4 repairs/h), 0.0002 failures/h (4 repairs/h), 0.0001
failures/h (0.004 repairs/h), 0.00015 failures/h (4 repairs/h), and 0.0001 failures/h
(0.004 repairs/h), respectively. Calculate the steady state probability of occurrence
of the top event: room without light.
Substituting the given data into Equation (7.27), we get the following steady
state probability of occurrence of events B1, B2, B3, and B4:
UA =
0.0002
0.0002 + 4
4.99 10 5
Similarly, inserting the specified data into Equation (7.27), the following steady
state probability of occurrence of events B5 and B7 is obtained:
UA =
0.0001
0.0001 + 0.004
= 0.0244
Finally, substituting the given data into Equation (7.27), we get the following
steady state probability of occurrence of event B6:
UA =
0.00015
0.00015 + 4
= 3.75 10 5
Using the UA calculated value for events B1, B2, B3, and B4 in Equation (7.29),
we get the following steady state probability of occurrence of event B8:
P (B8 ) = 4.99 10 5
= 6.25 10 18
FIGURE 7.11 Calculated steady state probability values of basic repairable events and of
intermediate and top events for the fault tree of Figure 7.2.
Substituting the UA calculated values for events B5 and B6 into Equation (7.28),
we obtain the following steady state probability of occurrence of event B9:
0.0244
Inserting the UA calculated values for events B7, B8, and B9 into Equation (7.28),
the following steady state probability of occurrence of event B10 is obtained:
][
][
Thus, the steady state probability of occurrence of the top event, room without
light, is 0.0482. The above calculated probability values are shown in the Figure 7.11
fault tree.
success tree from a fault tree, replace all OR gates with AND gates in the original
fault tree and vice versa, as well as replace all fault events with success events. For
example, the top fault event, room without light, becomes room lit.
Just like any other reliability analysis method, the FTA approach also has its
benefits and drawbacks. Some of the benefits of the FTA approach are as follows:
7.9 PROBLEMS
1. What are the purposes of performing FTA?
2. Write an essay on the history of the fault tree method.
3. Define the following named symbols used in constructing fault trees:
AND gate
Diamond
Circle
Rectangle
4. Discuss the basic process followed in developing fault trees.
5. What are the advantages and disadvantages of the FTA method?
6. Compare the FTA approach with the block diagram method (i.e., network
reduction technique) used in reliability analysis.
7. Assume that a windowless room has one switch and one light bulb.
Develop a fault tree for the top or undesired event dark room. Assume
the switch only fails to close.
8. Obtain a success tree for the fault tree shown in Figure 7.2.
9. Prove the following Boolean expression:
X + YZ = ( X + Y) ( X + Z )
(7.30)
10. Assume that in Figure 7.2 the basic independent fault events failure and
repair rates are 0.009 failures/h and 0.04 repairs/h, respectively. Calculate
the steady state probability of occurrence of the top event, room without light.
7.10
REFERENCES
1. Haasl, D.F., Advanced Concepts in Fault Tree Analysis, System Safety Symposium,
1965. Available from the University of Washington Library, Seattle.
2. Barlow, R.E., Fussell, J.B., and Singpurwalla, N.D., Eds., Reliability and Fault Tree
Analysis, Society for Industrial and Applied Mathematics (SIAM), Philadelphia,
1975. (Conference proceedings publication.)
3. Dhillon, B.S. and Singh, C., Bibliography of literature on fault trees, Microelectronics
and Reliability, 17, 501-503, 1978.
4. Fault Tree Handbook, Report No. NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, D.C., 1981.
5. Henley, E.J. and Kumamoto, H., Reliability Engineering and Risk Assessment, Prentice-Hall, Englewood Cliffs, NJ, 1981.
6. Dhillon, B.S. and Singh, C., Engineering Reliability: New Techniques and Applications, John Wiley & Sons, New York, 1981.
7. Schroder, R.J., Fault tree for reliability analysis, Proc. Annu. Symp. Reliability, 170-174,
1970.
8. Risk Analysis Using the Fault Tree Technique, Flow Research Report, Flow Research,
Inc., Washington, D.C., 1973.
9. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Eds., Handbook of Reliability
Engineering and Management, McGraw-Hill, New York, 1996.
10. Barlow, R.E. and Proschan, F., Statistical Theory of Reliability and Life Testing, Holt,
Rinehart and Winston, New York, 1975.
11. Fussell, J.B. and Vesely, W.E., A new methodology for obtaining cut sets for fault
trees, Trans. Am. Nucl. Soc., 15, 262-263, 1972.
12. Semanderes, S.N., Elraft, a computer program for efficient logic reduction analysis
of fault trees, IEEE Trans. Nuclear Sci., 18, 310-315, 1971.
13. Dhillon, B.S., Proctor, C.L., and Kothari, A., On repairable component fault trees,
Proc. Annu. Reliability Maintainability Symp., 190-193, 1979.
8.1 INTRODUCTION
Over the years, common cause failures have been receiving increasing attention
because of the realization that the assumption of independent failures may be violated
in the real life environment. For example, according to Reference 1, in the U.S.
power reactor industry of 379 components failures or groups of failures arising
from independent causes, 78 involved common cause failures of two or more components. A common cause failure may simply be described as any instance where
multiple items fail due to a single cause. Thus, the occurrence of common cause
failures leads to lower system reliability.
The late 1960s and early 1970s may be regarded as the beginning of the serious
realization of common cause failure problems when a number of publications on
the subject appeared [2-6]. In 1975, a number of methods were proposed to evaluate
reliablity of systems with common cause failures [7, 8]. In 1978, an article presented
most of the publications on common cause failures [9]. The first book that gave an
extensive covereage to common cause failures appeared in 1981 [10]. In 1994, an
article entitled Common Cause Failures in Engineering Systems: A Review presented a comprehensive review of the subject [11]. Nonetheless, ever since the late
1960s many people have contributed to the subject of common cause failures and
over 350 contributions from these individuals can be found in Reference 11.
A device is said to have three states if it operates satisfactorily in its normal
mode but fails in either of two mutually exclusive modes. Two examples of a three
state device are an electronic diode and a fluid flow valve. The two failure modes
pertaining to both these devices are open, short and open, close, respectively. In
systems having these devices, the redundancy may increase or decrease the system
reliability. This depends on three factors: type of system configuration, number of
redundant components, and the dominant mode of component failure.
The history of reliaiblity studies of three state devices may be traced back to
the works of Moore and Shannon [12] and Creveling [13] in 1956. In 1975, the
subject of three state device systems was studied in depth and a delta-star method
to evaluate reliability of complex systems made up of three state devices was
developed [14]. In 1977, an article entitled Literature Survey on Three State Device
Reliability Systems provided a list of publications on the subject [15]. A comprehensive review on the subject appeared in 1992 [16]. Needless to say, many other
individuals have also contributed to the topic and most of their contributions are
listed in Reference 16.
This chapter presents the topics of common cause failures and three state devices,
separately.
FIGURE 8.1 Block diagram of a parallel system with common cause failures. m is the
number of units and cc is the hypothetical unit representing common cause failures.
8.2.1
(8.1)
where
in is the unit independent mode failure rate.
cc is the unit or system common cause failure rate.
Thus, the fraction of common cause type unit failures is given by
=
cc
(8.2)
Therefore,
cc =
(8.3)
(8.4)
In Figure 8.1 block A represents independent mode failures and block B represents
the common cause failures. The complete system can fail only if the hypothetical
unit representing all system common cause failures fails or all units of the parallel
system fail. The reliability of the independent modified parallel system shown in
Figure 8.1 is given by
R mp = R A R B
= 1 (1 R in )
]R
(8.5)
cc
where
Rmp is the reliability of the modified parallel system (i.e., the parallel system
with common cause failures).
RA is the reliability of block A.
RB is the reliability of block B.
Rin is the units independent failure mode relialibility.
Rcc is the system common cause failure mode reliaiblity or the reliability of
the hypothetical unit representing all system common cause failures.
m is the number of identical units in parallel.
For constant failure rates, in and cc, we get the following reliability equations,
respectively:
R in (t ) = e (1 ) t
(8.6)
R cc (t ) = e t
(8.7)
where t is time.
Substituting Equations (8.6) and (8.7) into Equation (8.5) yields
R mp (t ) = 1 1 e (1 ) t
e t
(8.8)
1.0
0.9
=1
= 0.75
= 0.5
= 0.25
=0
0.8
0.7
Rmp(t)
0.6
m=2
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.5
1.0
1.5
2.0
FIGURE 8.2 Reliability plots of a two-unit parallel system with common cause failures.
It can be seen from Equations (8.2) and (8.8) that the parameter takes values from
0 to 1. Thus, at = 0 Equation (8.8) becomes simply a reliability expression for a
parallel system without common cause failures. More specifically, there are no
common cause failures associated with the parallel system. At = 1, it means all
the parallel system failures are of common cause type and the system simply acts
as a single unit with failure rate, .
Figures 8.2 and 8.3 show plots of Equation (8.8) for various specified values of
and for m = 2 and 3, respectively. In both the cases, the system reliability decreases
with the increasing values of .
The system mean time to failure is given by [18]:
MTTFmp =
R
0
mp
(t ) d t
(1)
i =1
i +1
m
{i (i 1) }
i
(8.9)
where
MTTFmp is the mean time to failure of the parallel system with common
cause failures.
m
= m ! i ! ( m i )!
i
(8.10)
FIGURE 8.3 Reliability plots of a three-unit parallel system with common cause failures.
d R mp (t )
1
R mp (t )
dt
)]
(8.11)
= + m (1 ) ( 1) m 1
where
mp (t) is the hazard rate of the parallel system with common cause failures.
= 1 1 e (1 ) t
(8.12)
k-out-of-m Network
Just like in the case of the parallel network, a hypothetical unit respresenting all
system common cause failures is connected in series with the independent and
identical unit k-out-of-m system. More specifically, this modified k-out-of m network
is same as the modified parallel network except that at least k units must function
normally instead of at least one unit for the parallel case. Thus, the k-out-of-m
system with common cause failures reliability is [10]
m m
mi
i
Rk m =
R in (1 R in ) R cc
i=k i
(8.13)
where
k is the required number of units for the k-out-of-m system success.
Rk/m is the reliability of the k-out-of-m system with common cause failures.
Inserting Equations (8.6) and (8.7) into Equation (8.13) we get
m
R k m (t ) =
i e
i (1 ) t
i=k
[1 e
i (1 ) t
mi
e t
(8.14)
In a manner similar to the parallel network case, the expressions for system mean
time to failure and hazard rate may be obtained. Also, the block diagram method
can be applied to other redundant configurations such as parallel-series and
bridge [18].
Example 8.1
Assume that a system is composed of two independent, active and identical units.
At least one unit must function normally for the sytem success. The value of the
is 0.5. Calculate the system reliability for a 100-h mission, if the unit failure rate is
0.004 failures/h.
Substituting the given data into Equation (8.8) yields
2
R mp (100) = 1 1 e (10.5) ( 0.004 ) (100 ) e ( 0.5) ( 0.004 ) (100 )
= 0.7918
Thus, the reliability of the two-unit parallel system with common cause failures
is 0.7918.
Example 8.2
If the value of is zero in Example 8.1, calculate the system reliability and compare
it with the system reliability result obtained in that example.
Using the given data in Equation (8.8) we get
Thus, in this case the system reliability without common cause failures is 0.8913.
In comparison to the system reliability result obtained in Example 8.1, i.e., 0.7918,
the occurrence of common cause failures leads to lower system reliability.
8.2.2
This is another method that can be used to perform reliability analysis of systems
with common cause failures. The application of this method is demonstrated by
applying it to the network shown in Figure 8.1. In this case, the top event is Network
Failure. The fault tree for the Figure 8.1 network is shown in Figure 8.4. The
probability of the top event occurrence, i.e., network failure, is given by
FN = FA + FB FA FB
(8.15)
where
FN is the network failure probability, i.e., failure propability of the parallel
system with common cause failures.
FA is the block A failure probability, i.e., failure probability of the parallel
system with independent failures.
FB is the block B failure probability, i.e., the failure probability of the hypothetical unit representing all system common cause failures.
In turn,
m
FA =
i =1
(8.16)
where
Fi is the failure probability of independent unit i; for i = 1, 2, , m.
The network reliability (i.e., the reliability of the parallel system with common
cause failures) is given by
R mp = 1 FN
(8.17)
Example 8.3
Assume that a parallel system with common cause failures is composed of two units
and the system common cause failure occurrence probability is 0.2. In addition, each
units independent failure probability is 0.1. Calculate the failure probability of the
parallel system with common cause failures using the fault tree method.
For m = 2, the fault tree shown in Figure 8.4 will represent this example and
the specific fault tree is given in Figure 8.5. In this case, the top event is the failure
of the parallel system with common cause failures. The occurrence probability of
the top event can be calculated by using Equations (8.15) and (8.16).
Substituting the given data into Equation (8.16) yields
FA = F1 F2
= (0.1) (0.1)
= 0.01
Using the calculated and the given data in Equation (8.15), we get
FN = (0.01) + 0.2 (0.01) (0.2)
= 0.2080
The given and the calculated probability values are shown in the Figure 8.5 fault
tree. The failure probability of the two-unit parallel system with common cause
failures is 0.2080.
Example 8.4
By using the given data of Example 8.3 and the block diagram method, prove that
the end result obtained in Example 8.3 is the same.
Thus, we have
R CC = 1 0.2
= 0.8
R in = 1 0.1
= 0.9
m=2
By inserting the above values into Equation (8.5) we get
R mp = 1 (1 0.9)
] (0.8)
= 0.7920
Using the above result in Equation (8.17) yields
FN = 0.2080
It proves that Example 8.3 solved through two different methods, fault tree and
block diagram, gives the same end result.
8.2.3
MARKOV METHOD
This is another method that can be used to perform reliability analysis of systems
with common cause failures. The systems can be either repairable or nonrepairable.
The following three models demonstrate the applicability of the method to redundant
systems with common cause failures:
FIGURE 8.6 State space diagram of a two-unit parallel system with common cause failures.
The numbers and letters denote system state.
Model I
This model represents a two independent unit parallel system with common cause
failures [19]. The system can fail due to a common cause failure when both the units
are active. For the system success, at least one unit must operate normally. The
system transition diagram is shown in Figure 8.6.
The following assumptions are associated with this model:
d P0 (t )
+ (2 + cc ) P0 (t ) = 0
dt
(8.18)
d P1 (t )
+ P1 (t ) = 2 P0 (t )
dt
(8.19)
d P2 (t )
= P1 (t )
dt
(8.20)
d Pcc (t )
= cc P0 (t )
dt
(8.21)
2 + cc ) t
P1 (t ) = X e t e (
(8.22)
(8.23)
where
X = 2 (2 + cc )
(8.24)
2 + cc ) t
P2 (t ) = Y + X Z e (
e t
(8.25)
where
Y = 2 (2 + cc )
(8.26)
Z = (2 + cc )
(8.27)
Pcc (t ) =
cc
2 + cc ) t
1 e (
2 + cc
(8.28)
2 + cc ) t
2
=e (
+ X e t e (
+ cc ) t
(8.29)
The mean time to failure of the parallel system with common cause failures is given
by
MTTFps =
ps
(t ) d t
(8.30)
= 3 (2 + cc )
where
MTTFps is the mean time to failure of the two unit parallel system with
common cause failures.
The probability of the system failure due to a common cause failure is given by
Pcc (t ) =
cc
2 + cc ) t
1 e (
2 + cc
(8.31)
Example 8.5
Assume that a system has two identical units in parallel and it can fail due to the
occurrence of a common cause failure. The common cause failure occurrence rate
is 0.0005 failures/h. Calculate the following:
if the unit failure rate is 0.004 failures/h:
System mean time to failure
System mean time to failure, if the occurrence of common cause failures
is equal to zero.
Compare the above two results.
By substituting the given data into Equation (8.30), we get
MTTF = 3 [2 (0.004) + 0.0005]
= 353 h
For zero common cause failures and the other given data, Equation (8.30) yields
MTTF = 3 2 (0.004)
= 375 h
It means that the occurrence of common cause failures leads to the reduction in
the system mean time to failure (i.e., from 375 h to 353 h).
FIGURE 8.7 The state transition diagram of the system without repair. CCF means common
cause failure.
Model II
This model represents a system composed of two active parallel units and one
standby subject to two types of common cause failures [20]. The type I common
cause failure leads to total system failure (active plus the standby units) whereas,
type II failures only cause the failure of the active two units. The system units can
also malfunction due to a normal failure.
The system transition diagram is shown in Figure 8.7. The system starts operating (two parallel units and one unit on standby) at time t = 0. As soon as one of
the operating units fails due to a normal failure, it is immediately replaced with the
standby unit. After this, the system may fail due to a common cause failure or one
of the operating units fails because of the normal failure. The single operating unit
may fail either due to a normal failure or a common cause failure.
In the state when two units operate and one is on standby, the system units may
also fail due to type I and type II common cause failures. In the case of the occurrence
of type II common cause failure, the standby unit is immediatly activated. For the
system success, at least one of these units must operate normally.
The following assumptions are associated with the analysis of the Figure 8.7
system transition diagram.
i.
ii.
iii.
iv.
The state of the system shown in Figure 8.7: i = 0 (two units operating
in parallel and one on standby), i = 1 (one of the operating units fails
and the standby is activated), i = 2 (one unit operating, two units
failed), i = 3 (two active parallel units failed due to a type II common
cause failure and the standby unit is activated), i = 4 (system failed,
all three units failed due to a type I common cause failure), i = 5
(system failed due to normal failures), i = 6 (system failed due to a
common cause failure when operating with two or less good units).
Pi (t) The probability that the system is in state i at time t; for i = 0, 1, 2,
3, 4, 5, 6.
Common cause failure rate of the system with one good unit.
R (t) System reliability at time t.
MTTF System mean time to failure.
The system of equations associated with Figure 8.7 is
d P0 (t )
+ (2 + c + ) P0 (t ) = 0
dt
(8.32)
d P1 (t )
+ (2 + ) P1 (t ) = 2 P0 (t )
dt
(8.33)
d P2 (t )
+ ( + ) P2 (t ) = 2 P1 (t )
dt
(8.34)
d P3 (t )
+ ( + ) P3 (t ) = P0 (t )
dt
(8.35)
d P4 (t )
= P0 (t ) c
dt
(8.36)
d P5 (t )
= P2 (t ) + P3 (t )
dt
(8.37)
d P6 (t )
= P1 (t ) + P2 (t ) + P3 (t )
dt
(8.38)
Solving Equations (8.32) through (8.38), we get the following state probability
expressions:
P0 (t ) = e A t
P1 (t ) =
(8.39)
2 Bt
e e A t
c
BT
e Bt
e
e A T
P2 (t ) = 4 2
+
+
(B C) (A C) (C B) (A B) (C A) (B A)
P3 (t ) =
Ct
e e A t
D
(8.40)
(8.41)
(8.42)
where
A = 2 + c +
B = 2 +
C=+
D = + c +
P4 ( t ) =
P5 (t ) =
c
1 e A t
A
C e A t A eC t
1+
CA
D
1
e A t
eC t
e Bt
+ 4 3
ABC AD c CDE BE c
(8.43)
(8.44)
where
E = +
P6 (t ) =
2 Be A t A e Bt
1 +
AB
c
(8.45)
Ce A t A e Ct
1+
A C
D
1
eA t
e Ct
e Bt
+ 4 2
+
ABC A D c CDE BE c
The system reliability is given by
3
R (t ) =
P (t )
i
i=0
= e A t +
2 Bt
e e A t
c
)
(8.46)
Bt
eC t
e
e At
+ 4 2
+
+
(B C) (A C) (C B) (A B) (C A) (B A)
B C t
e e A t
D
MTTF =
R (t ) d t
0
(8.47)
1 2 4
+
+
1+
A
B BC C
2
Model III
This model is the same as Model II except that the repair process is initiated as soon
as one of the two active units fails and is set equal to zero. Thus, the failed
unit/units in state 1, 2, and 3 of Figure 8.7 are repaired at a constant repair rate .
The additional symbols used in this model are as follows:
s
is the Laplace transform variable.
R(s) is the Laplace transform of the system reliability, R(t).
Pi (s) is the Laplace transform of Pi (t), for i = 0, 1, 2, , 6.
Thus, for this model, the Laplace transforms of the state probabilities are as
follows:
P0 (s) =
s2 + s G + F
s3 + s2 J (H + GA) + I
(8.48)
where
F = 2 2 + + + 2 +
G = 3 + 2 +
H = 2 2 + + + 2
(8.49)
I = c F + 4 3 + 4 2 + 3 + 2 + 2 + 2
J = 5 + 2 + 2 + c
P1 (s) =
P0 (s) 2 (s + + )
s2 + s G + F
(8.50)
P0 (s) 4 2
s2 + s G + F
(8.51)
P0 (s)
s++
(8.52)
P2 (s) =
P3 (s) =
P4 (s) = P0 (s) c s
(8.53)
(8.54)
P6 (s) =
2 (s + + ) P0 (s)
s s2 + s G + F
(8.55)
R r (s) =
P (s)
i
(8.56)
i=0
F
1
= F + 2 ( + ) + 4 2 +
I
+
(8.57)
8.3.1
RELIABILITY EVALUATION
OF A
SERIES NETWORK
The block diagram of a series network composed of m three state devices is shown
in Figure 8.8. Each block represents a three state device i that can operate normally
(Ni ) or fail either open (Oi ) or short (Si ) circuited. The system operates normally
when either all the devices function successfully or until at least one device is
working normally and the remaining ones have failed in their short circuit mode.
This is based on the assumption that the current can still flow through a short circuited
device. On the other hand, the system fails when either any one of the devices fails
in open mode or all the system devices are short circuited. A reliability equation for
this network can be developed by using the probability tree method [21]. A probability tree for a two component three state device network is given in Figure 8.9.
The dot on the extreme left side of Figure 8.9 signifies the point of origin of
paths and the dots on the extreme right side indicate the paths termination points.
The total number of different paths in a three state device probability tree is given
by 3m. The m denotes the number of units (i.e., three-state state devices) in a network.
The following symbols are used in the probability tree analysis:
Ni denotes the normal mode (i.e., success) of the ith unit (i.e., ith path links
success)
Oi denotes the open mode failure state of the ith unit (i.e., ith path links
failure)
Si denotes the short mode failure state of the ith unit (i.e., ith path links
failure)
Pi probability of being in State Ni
FIGURE 8.9 Probability tree diagram for two non-identical unit three state device network.
(8.58)
Similarly, the three non-identical unit parallel system success paths are as follows:
N1 N2 N3, N1 N2 S3, N1 S2 N3, N1 S2 S3, S1 N2 N3, S1 N2 S3, S1 S2 N3
Thus, the reliability of the three independent non-identical unit series system is
given by
R S3 = P1 P2 P3 + P1SP2 q
+ P1 q S2 P3 + P1 q S2 q S3 + q S1 P2 P3
+ q S1 P2 q S3 + q S1 q S2 P3
(8.59)
= (1 q O1 ) (1 q O 2 ) (1 q O3 ) q S1 q S2 q S3
Thus, for an m unit or device series system, we can generalize Equation (8.59) to get
m
R Sm =
i =1
(1 q Oi )
q
i =1
Si
(8.60)
FIGURE 8.10 State space diagram of a three state device. N, O, and S in rectangle, diamond,
and circle mean device operating normally, device failed in open mode, and device failed in
short mode, respectively.
(8.61)
where
qo is the open mode failure probability of the device.
qS is the short mode failure probability of the device.
For constant open and short mode failure rates, we use the Markov method to
obtain the open and short failure probabilities of a three state device. Figure 8.10
presents the state space diagram of a three state device.
The following symbols are associated with the Figure 8.10 model:
i
is the ith state of the three state device: i = N (means the three state
device is operating normally), i = O (means the three state device failed
in open mode), i = S (means the three state device failed in short mode).
Pi (t) is the probability that the three state device is in state i at time t; for
i = N, O, S.
O is the constant open mode failure rate of the three state device.
S is the constant short mode failure rate of the three state device.
The following system of equations is associated with Figure 8.10.
d PN (t )
+ ( O + S ) PN (t ) = 0
dt
(8.62)
d PO (t )
O PN (t ) = 0
dt
(8.63)
d PS (t )
S PN (t ) = 0
dt
(8.64)
PO (t ) = q O (t ) =
(8.65)
O
+ t
1 e ( O S)
O + s
(8.66)
R Sm (t ) =
1
i =1
i =1
Oi
+ t
1 e ( Oi Si )
Oi + Si
Si
+ t
1 e ( Oi Si )
+
Oi
Si
PS (t ) = q S (t ) =
(8.68)
S
+ t
1 e ( O S)
O + s
(8.67)
where
RSm (t) is the three state device series system reliability at time t.
For identical devices, Equation (8.68) becomes
O
+ t
R Sm (t ) = 1
1 e ( O S)
O + S
S
+ t
1 e ( O S)
O + S
(8.69)
Example 8.6
A system is composed of three independent and identical three state devices in series.
Each devices open mode and short mode failure probabilities are 0.05 and 0.03,
respectively. Calculate the system reliability.
Substituting the given data into Equation (8.61) yields
R S3 = (1 0.05) (0.03)
3
= 0.8573
Thus, the series system reliability is 0.8573.
Example 8.7
A three state device network has two independent and identical units in series. Each
units open mode and short mode failure rates are 0.002 and 0.004 failures/h,
respectively. Calculate the network reliability for a 100-h mission.
Substituting the given data into Equation (8.69) yields
0.002
R S2 (100) = 1
e 1
0.002 + 0.004
0.004
1 e ( 0.002 +0.004 ) (100 )
0.002 + 0.004
= 0.6314
Thus, the network reliability for a 100-h mission is 0.6314.
8.3.2
RELIABILITY EVALUATION
OF A
PARALLEL NETWORK
Figure 8.11 shows the block diagram of a parallel network having m three state
devices. Each block denotes a three state device i that can operate normally (Ni ) or
fail either open (Oi ) or short (Si ) circuited. The system functions normally when
either all the devices work successfully or until at least one device operates normally
and the remaining ones have failed in their open mode. In contrast, the system fails
when all the devices fail in their open mode or any one of them fails in short mode.
As for the series system, the Figure 8.9 probability tree can be used to develop
a reliability expression for the two unit parallel system. Thus, from Figure 8.9 the
two non-identical device parallel system success paths are as follows:
N1 N2
N1 O2
O1 N2
With the aid of the above paths and the relationships P1 +qO1 + qS1 = 1 and P2 +qO2 +
qS2 = 1, the independent parallel network reliability is
R p 2 = P1 P2 + P1 q O 2 + P2 q O1
(8.70)
= (1 q S1 ) (1 q S2 ) q O1 q O 2
Similarly, the three non-identical unit parallel system success paths are as follows:
N1 N 2 N 3 , N1 N 2 O3 , N1 O2 N 3 , N1 O2 O3 , O1 N 2 N 3 , O1 N 2 O3 , O1 O2 N 3
With the aid of the above paths, the reliability of the three independent nonidentical unit parallel system is given by
R p 3 = P1 P2 P3 + P1OP2 q
+ P1 q O 2 P3 + P1 q O 2 q O3 + q O1 P2 P3
+ q O1 P2 q O3 + q O1 q O 2 P3
(8.71)
= (1 q S1 ) (1 q S2 ) (1 q S3 ) q O1 q O 2 q O3
Thus, for an m unit or device parallel system, we can generalize Equation (8.71) as
follows:
m
(1 q ) q
R pm =
Si
i =1
(8.72)
Oi
i =1
(8.73)
For a devices constant open and short mode failure rates, O and S, respectively,
we substitute Equations (8.66) and (8.67) into Equation (8.72) to get
R pm (t ) =
1
i =1
i =1
Si
+ t
1 e ( Oi Si )
+
Oi
Si
Oi
+ t
1 e ( Oi Si )
Oi + Si
(8.74)
where
Rpm (t) is the three state device parallel system reliability at time t.
For identical devices, Equation (8.74) simplifies to
S
+ t
R pm (t ) = 1
1 e ( O S)
+
O
S
S
+ t
1 e ( O S)
O + S
(8.75)
Example 8.8
Assume that in Example 8.6 the system is parallel instead of series and the other
data are exactly the same. Calculate the parallel system reliability.
Inserting the given values into Equation (8.73) we get
R p3 = (1 0.03) (0.05)
= 0.9125
Thus, the parallel system reliability is 0.9125.
Example 8.9
Assume that in Example 8.7 the three state device network is parallel instead of
series and the other given data are exactly the same. Determine the parallel network
reliability for a 100-h mission.
Thus, substituting the specified values into Equation (8.75) yields
0.004
R p2 (100) = 1
e 1
0.002 + 0.004
0.002
1 e ( 0.002 + 0.004 ) (100 )
+
.
.
0
002
0
004
= 0.4663
Thus, the parallel network reliability for a 100-h mission is 0.4663.
8.3.3
RELIABILITY OPTIMIZATION
OF A
SERIES NETWORK
In this case for independent and identical devices, the series network reliability from
Equation (8.61) is [10]
R Sm = (1 q O ) q sm
m
(8.76)
(8.77)
m* =
l n l n q s l n (1 q O )
l n (1 q o ) q S
(8.78)
where
m* is the optimum number of devices in series.
Example 8.10
Assume that open and short mode failure occurrence probabilities of independent
and identical three state devices are 0.02 and 0.05, respectively. Determine the
number of devices to be arranged in series to maximize system reliability.
Thus, substituting the given data into Equation (8.78), we get
m* =
l n [l n (0.05) l n (1 0.02)]
l n [(1 0.02) (0.05)]
2 devices
8.3.4
RELIABILITY OPTIMIZATION
OF A
PARALLEL NETWORK
From Equation (8.73), the parallel network, with independent and identical devices,
reliability is
R pm = (1 q S ) q Om
m
(8.79)
R pm
m
= (1 q S ) l n (1 q S ) q Om l n q O = 0
m
(8.80)
m* =
l n l n q O l n (1 q S )
l n (1 q S ) q O
(8.81)
where
m* is the optimum number of devices in parallel.
Example 8.11
After examining the failure data of one type of identical three state devices, it is
estimated that their open and short mode failure probabilities are 0.08 and 0.04,
respectively. Calculate the number of independent devices to be placed in a parallel
configuration to maximize system reliability.
Inserting the given data into Equation (8.81) yields
m* =
l n [l n (0.08) l n (1 0.04)]
devices
Thus, for maximum system reliability two three state devices should be arranged
in parallel.
8.4 PROBLEMS
1. Define the following three terms:
Three state device
Common-cause failure
Short mode failure
2. Discuss at least five causes for the occurrence of common cause failures.
3. Describe the methods that can be used to perform reliability analysis of
redundant networks with common cause failures.
4. Using Equation (8.14), obtain an expression for mean time to failure of
a k-out-of-m network with common cause failures.
5. Assume that in Figure 8.6 the failed unit is reparable from the system
state 1 at a constant repair rate . Develop expressions for system reliability and mean time to failure.
6. A system is composed of two identical units in parallel and it can malfunction due to the occurrence of a common cause failure. The common
cause failure rate is 0.0001 failures/h. Calculate the system reliability and
common cause failure probability for a 200-h mission, if the unit failure
rate is 0.001 failures/h.
7. An electronic device can fail either in its open mode or short mode or it
simply operates normally. The open and short mode failure rates of the
device are 0.007 and 0.005 failures/h, respectively. Calculate the probability of the device failing in its short mode during a 1000-h operational
period.
8. A series system is composed of two independent and identical three state
devices. The open and short mode failure rates of the device are 0.0003
and 0.007 failures/h, respectively. Calculate the system reliability for a
50-h mission period.
9. A parallel system is composed of three independent and non-identical
three state devices. The open mode failure probabilities of devices A, B,
and C are 0.02, 0.04, and 0.07, respectively. Similarly, the short mode
failure probabilities of devices A, B, C are 0.03, 0.05, and 0.01, respectively. Calculate the open and short mode failure probabilities of the
parallel system.
10. Assume that open and short mode failure occurrence probabilities of
independent and identical three state devices are 0.12 and 0.10, respectively. Determine the number of devices to be connected in parallel to
maximize system reliability.
8.5 REFERENCES
1. Taylor, J.R., A study of failure causes based on U.S. power reactor abnormal occurrence reports, Reliab. Nucl. Power Plants, IAEA-SM-195/16, 1975.
2. Epler, E.P., Common-mode considerations in the design of systems for protection
and control, Nuclear Safety, 11, 323-327, 1969.
3. Ditto, S.J., Failures of systems designed for high reliability, Nuclear Safety, 8, 35-37,
1966.
4. Epler, E.P., The ORR emergency cooling failures, Nuclear Safety, 11, 323-327, 1970.
5. Jacobs, I.M., The common-mode failure study discipline, IEEE Trans. Nuclear Sci.,
17, 594-598, 1970.
6. Gangloff, W.C., Common-mode failure analysis is in, Electronic World, October,
30-33, 1972.
7. Gangloff, W.C., Common mode failure analysis, IEEE Trans. Power Apparatus Syst.,
94, 27-30, 1975.
8. Fleming, K.N., A redundant model for common mode failures in redundant safety
systems, Proc. Sixth Pittsburgh Annu. Modeling Simulation Conf., 579-581, 1975.
9. Dhillon, B.S., On common-cause failures Bibliography, Microelectronics and
Reliability, 18, 533-534, 1978.
10. Dhillon, B.S. and Singh, C., Engineering Reliability: New Techniques and Applications, John Wiley & Sons, New York, 1981.
11. Dhillon, B.S. and Anude, D.C., Common-cause failures in engineering systems: a
review, Int. J. Reliability, Quality, Safety Eng., 1, 103-129, 1994.
12. Moore, E.F. and Shannon, C.E., Reliable circuits using less reliable relays, J. Franklin
Inst., 9, 191-208, 1956, and 281-297, 1956.
13. Creveling, C.J., Increasing the reliability of electronic equipment by the use of
redundant circuits, Proc. Inst. Radio Eng. (IRE), 44, 509-515, 1956.
14. Dhillon, B.S., The Analysis of the Reliability of Multi-State Device Networks, Ph.D.
Dissertation, 1975, available from the National Library of Canada, Ottawa.
15. Dhillon, B.S., Literature survey on three-state device reliability systems, 16, 601-602,
1977.
16. Lesanovsky, A., Systems with two dual failure modes A survey, Microelectronics
and Reliability, 33, 1597-1626, 1993.
17. WASH 1400 (NUREG-75/014), Reactor Safety Study, U.S. Nuclear Regulatory Commission, Washington, D.C., October 1975. Available from the National Technical
Information Service, Springfield, VA.
18. Dhillon, B.S. and Proctor, C.L., Common mode failure analysis of reliability networks, Proc. Annu. Reliability Maintainability Symp., 404-408, 1977.
19. Dhillon, B.S., Reliability in Computer System Design, Ablex Publishing, Norwood,
NJ, 1987.
20. Dhillon, B.S. and Yang, N., Analysis of an engineering system with two types of
common cause failures, Proc. First Intl. Conf. Quality Reliability, Hong Kong, 2,
393-397, 1995.
21. Dhillon, B.S. and Rayapati, S.N., Reliability evaluation of multi-state device networks
with probability trees, Proc. 6th Symp. Reliability Electron., Budapest, 27-37, 1985.
Mechanical Reliability
9.1 INTRODUCTION
Usually, the reliability of electronic parts is predicted on the assumption that their
failure times are exponentially distributed (i.e., their failure rates are constant). The
past experience also generally supports this assumption. Nonetheless, this assumption was derived from the bathtub hazard rate concept which states that during the
useful life of many engineering items, the failure rate remains constant. However,
in the case of mechanical items, the assumption of constant failure rate is not
generally true. In fact, in many instances their increasing failure rate patterns can
be represented by an exponential function.
The history of mechanical reliability may be traced back to World War II with
the development of V1 and V2 rockets by the Germans. In 1951, Weibull published
a statistical function to represent material strength and life length [1]. Today it is
known as the Weibull distribution and it has played an important role in the development of mechanical reliability and reliability in general. Also, in the 1950s
Freudenthal reported many advances to structural reliability [2-4].
In 1963 and 1964, the National Aeronautics and Space Administration (NASA)
lost SYNCOM I and Mariner III due to busting of a high pressure tank and mechanical failure, respectively. As a result of these and other mechanical failures, researchers in the field called for the improvement in reliability and longevity of mechanical
and electromechanical parts. In years to come, NASA spent millions of dollars to
test, replace, and redesign various items, including mechanical valves, filters, actuators, pressure gauges, and pressure switches [5].
Some of the major projects initiated by NASA in 1965 were entitled Designing
Specified Reliability Levels into Mechanical Components with Time-Dependent
Stress and Strength Distributions, Reliability Demonstration Using Overstress
Testing, and Reliability of Structures and Components Subjected to Random
Dynamic Loading [6]. After the mid-1960s, many other people [7-9] contributed
to mechanical reliability and a comprehensive list of publications on the subject is
given in Reference 10.
This chapter presents various different aspects of mechanical reliability.
FIGURE 9.1 Basic reasons for the development of the mechanical reliability field.
Breakage: 61.2%
Surface fatigue: 20.3%
Wear: 13.2%
Plastic flow: 5.3%
Furthermore, the failure causes were grouped into five distinct categories: (1) service
related (74.7%), (2) heat-treatment related (16.2%), (3) design related (6.9%),
(4) manufacturing related (1.4%), and (5) material related (0.8%). In turn, each of
these five categories were further broken into elements as follows:
Service-Related (74.7%)
Design-Related (6.9%)
Wrong design: 2.8%
Specification of suitable heat treatment: 2.5%
Incorrect material selection: 1.6%
Manufacturing-Related (1.4%)
Grinding burns: 0.7%
Tool marks or notches: 0.7%
Material-Related (0.8%)
Steel defects: 0.5%
Mixed steel or incorrect composition: 0.2%
Forging defects: 0.1%
9.4.1
SAFETY FACTOR
There are many different ways of defining a safety factor [9, 10, 1619]. Two different definitions of the safety factor are presented below.
Definition I
The safety factor is defined by [20]:
SF =
MFGST
1
MFGS
(9.1)
where
SF
is the safety factor.
MFGST is the mean failure governing strength.
MFGS is the mean failure governing stress.
This safety factor is a good measure when both the stress and strength are
described by a normal distribution. However, it is important to note that when the
variation of stress and/or strength is large, this measure of safety becomes meaningless because of positive failure rate.
Definition II
The safety factor is defined by [12, 13]:
SF =
SM
MAWS
(9.2)
where
SM
is the strength of the material.
MAWS is the maximum allowable working stress.
The value of this safety factor is always greater than unity. In fact, its value less
than unity indicates that the item under consideration will fail because the applied
stress is greater than the strength.
The standard deviation of the safety factor is expressed by [12, 13]:
= ( th MAWS) + SM (MAWS) s2
2
12
(9.3)
where
is the standard deviation of the safety factor.
th is the standard deviation of the strength.
s is the standard deviation of the stress.
All in all, there are many factors that must be carefully considered in selecting an
appropriate value of the safety factor: uncertainty of load, cost, failure consequence,
9.4.2
SAFETY MARGIN
(9.4)
where
m is the safety margin.
SF is the safety factor.
The value of this measure is always greater than zero and its negative value indicates
that the item under consideration will fail. For normally distributed stress and
strength, the safety margin is expressed by [9]:
m = ( th ms ) th
(9.5)
where
th is the average strength.
ms is the maximum stress.
th is the standard deviation of the strength.
The maximum stress, ms , is given by
ms = s + k s
(9.6)
where
s is the mean value of the stress.
s is the standard deviation of the stress.
k is a factor which takes values between 3 and 6.
It is to be noted that just like the safety factor, the safety margin is a random variable.
Example 9.1
Assume that the following data are known for a mechanical item under design:
th = 30,000 psi,
th = 1200 psi,
s = 15,000 psi
s = 300 psi
= 11.75
For k = 6
m =
Thus, the values of the safety margin for k = 3 and 6 are 11.75 and 11, respectively.
(9.7)
where
R
P
y
x
is
is
is
is
the
the
the
the
item reliability.
probability.
stress random variable.
strength random variable.
R=
f ( y) f (x) d x d y
y
where
f (y) is the probability density function of the stress.
f (x) is the probability density function of the strength.
(9.8)
Equation (9.8) can also be written in different forms and three such forms are as
follows:
R=
f (x) f ( y) d y d x
R=
f (y) 1 f (x) d x d y
R=
f (x) 1 f (y) d y d x
(9.9)
(9.10)
(9.11)
It is to be noted that Equations (9.10) and (9.11) were written using the following
two relationships in Equations (9.8) and (9.9), respectively:
f (x) d x + f (x) d x = 1
f ( y) d y + f ( y) d y = 1
(9.12)
(9.13)
Using the above reliability equations, the following stress-strength models were
developed for defined stress and strength probability density functions of an item:
MODEL I
In this case, stress and strength associated with an item are exponentially distributed.
Thus, we have
f ( y) = e y ,
0y<
(9.14)
f (x) = e x ,
0x<
(9.15)
and
where
and are the reciprocals of the mean values of stress and strength,
respectively.
Inserting Equations (9.14) and (9.15) into Equation (9.8) yields
R=
e y e x d y
y
+ y
e ( ) dy
(9.16)
=
For = 1 and =
y
1
x
x
x+y
(9.17)
where
x and y are the mean strength and stress, respectively.
Dividing the top and bottom of the right-hand side of Equation (9.17) by x, we get
R=
1
1+ y x
(9.18)
1
=
1+
where
= y/x.
For reliable design x y, thus 1. For given values of , using Equation (9.18),
we obtained the tabulated values for item reliability as shown in Table 9.1.
Table 9.1 values indicate that as the value of increases from 0 to 1, the item
reliability decreases from 1 to 0.5. More specifically, as the mean stress increases,
the item reliability decreases, accordingly.
Example 9.2
Assume that stress and strength associated with an item are exponentially distributed
with mean values of 10,000 and 35,000 psi, respectively. Compute the item reliability.
Substituting the given data into Equation (9.17) yields
R=
35, 000
35, 000 + 10, 000
= 0.7778
Thus, the item reliability is 0.7778.
TABLE 9.1
Values for Item Reliability
Item reliability, R
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
0.9091
0.8333
0.7692
0.7143
0.6667
0.6250
0.5882
0.5556
0.5263
0.5
MODEL II
In this case, an items stress and strength are described by normal and exponential
probability density functions, respectively, as
f ( y) =
1 y 2
1
exp
2
2
f (x) = e x ,
< y <
x0
(9.19)
(9.20)
where
is the mean stress.
is the standard deviation associated with stress.
is the reciprocal of the mean strength.
Substituting Equations (6.19) and (6.20) into Equation (9.8), we get
R=
1
1 y 2
exp
2
2
e x d x dy
1 y
1
exp
+ y dy
2
2
2
(9.21)
since
( y )2 + y = 2 2 2 4 + ( y + 2 )
2 2
2 2
(9.22)
(9.23)
Example 9.3
Assume that the strength of an item follows an exponential distribution with a mean
of 35,000 psi. The stress acting upon the item is normally distributed with mean
10,000 psi and standard deviation 5,000 psi. Calculate the items reliability.
By inserting the given data into Equation (9.23), we get
1 2 (10, 000) (5, 000)2
R = exp
(35, 000)2
2 35, 000
1
= exp {0.5714 0.0204}
2
= 0.7592
Thus, the items reliability is 0.7592.
MODEL III
In this case, the strength of the item is normally distributed and the stress acting
upon it follows the exponential distribution. Thus, we have
1 x 2
1
x
f (x) =
exp
,
x 2
2 x
f ( y) = e y ,
where
is the reciprocal of the mean stress.
x is the mean strength.
x is the strength standard deviation.
y>0
< x <
(9.24)
(9.25)
R=
1 x 2 x
1
y
x
exp
d
x
e
d
y
x 2
2 x o
1 x 2
1
x
exp
+ x d x
= 1
x 2
2 x
(9.26)
since
2
2 x x2 + x2 x + x x4 2
1 x x
+
x
=
2 2
2 x2
(9.27)
(9.28)
where
1 2 + x2
1
x
G
exp x
dx
x 2
x
2
(9.29)
(9.30)
Example 9.4
An items stress and strength are described by exponential and normal distributions,
respectively. The mean of the stress is 10,000 psi and the mean and standard deviation
of the strength are 30,000 and 3,000 psi, respectively. Calculate the items reliability.
Inserting the specified data into Equation (9.30), we get
1 2 (30, 000) (3, 000)2
R = 1 exp
(10, 000)2
2 10, 000
1
= 1 exp (6 0.09)
2
= 0.9479
Thus, the items reliability is 0.9479.
Other stress-strength models are presented in Reference 10.
9.6
This is a useful method for estimating an items reliability when its associated stress
and strength distributions cannot be assumed, but there is an adequate amount of
empirical data. This approach can also be used when stress and strength distributions
associated with an item are known. Obviously, the method is based on Mellin
transforms and for Equation (9.8) they are defined as follows:
X = f (x) d x
y
(9.31)
= 1 F1 (y)
and
Y = f (y) dy = F2 (y)
(9.32)
where
F1 (y) and F2 (y) are the cumulative distribution functions.
Taking the derivative of Equation (9.32) with respect to y, we have
dY
= f ( y)
dy
(9.33)
d Y = f ( y) d y
(9.34)
Obviously, it can be easily seen from Equation (9.32) that Y takes values of 0 to 1
(i.e., at y = 0, Y = 0 and at y = , Y = 1). Thus, inserting Equations (9.31) and (9.34)
into Equation (9.8) yields
R=
X dY
(9.35)
The above equation indicates that the area under the X vs. Y plot represents item
reliability. This area can be estimated by using the Simpsons rule, expressed below.
k
f (z) d z
j
kj
(W0 + 4 W1 + 2 W2 + 4 W3 +) + 2 Wn2 + 4 Wn1 + Wn
3n
(9.36)
where
is a function of z defined over interval (j, k), i.e., j z k.
is the even number of equal subdivided parts of interval (j, k).
f (z)
n
kj
n
Y=
y
1
e 10,000 d y
10, 000
= 1 e
(9.37)
y
10 ,000
Inserting Equation (9.15) and the given relevant data of Example 9.2 into
Equation (9.31), we have
X = 1 1 e
=e
35,000
(9.38)
y
35,000
Table 9.2 presents values of Y and X using Equations (4.37) and (4.38), respectively,
for assumed values of stress y.
TABLE 9.2
Computed Values of Y and X for
the Assumed Values of Stress y
y (psi)
0
4,000
8,000
12,000
16,000
20,000
24,000
28,000
32,000
36,000
40,000
44,000
48,000
0
0.33
0.55
0.70
0.80
0.87
0.91
0.94
0.96
0.97
0.98
0.99
0.992
1
1
0.89
0.80
0.71
0.63
0.57
0.50
0.45
0.40
0.36
0.32
0.29
0.25
0
Figure 9.3 shows the plot of X vs. Y for data given in Table 9.2. The area under
the Figure 9.3 curve is estimated using Equation (9.36) as follows:
R
(1 0) W + 4 W + 2 W + 4 W + W
[ 0
1
2
3 ]
4
3 ( 4)
1
[1 + 4 (0.95) + 2 (0.84) + 4 (]0.68) + 0
12
0.7667
The above result, i.e., R = 0.7667, is quite close to the one obtained using the
analytical approach in Example 9.2, i.e., R = 0.7778.
9.7.1
(9.39)
where
bs
bfm
s
b
se
a
h
is
is
is
is
is
is
is
the
the
the
the
the
the
the
The values of bs, bfm, s, b, se, a, and h are obtained through various
means [25, 28]. For example, bfm can be calculated by using the following
equation [25]:
4
bfm = bbfm
(9.40)
i =1
where
bbfm is the base failure rate associated with the brake friction material.
fi
is the ith multiplying factor that considers the effects on the base failure
rate of items such as dust contaminants (i = 1), ambient temperature
(i = 2), brake type (i = 3), and counter-surface roughness (i = 4).
In turn, the base failure rate of the brake friction material is given by [29]:
bbfm = 3 10 3 . R W . W . n
. . ( A C) ( L T ) . A
2
(9.41)
where
LT
A
RW
W
n
AC
9.7.2
(9.42)
where
cs is the clutch system failure rate, expressed in failure/106 h.
cfm is the failure rate of clutch friction material.
Other symbols used in Equation (9.42) are the same as the ones used in Equation (9.39).
The failure rate of clutch friction materials, cfm, is expressed by
cfm = bcfm f1 f2
(9.43)
where
bcfm is the base failure rate of clutch friction material.
f1
is the factor that considers the effects on the base failure rate of multiple
plates.
f2
is the factor that considers the effects on the base failure rate of ambient
temperature.
The clutch friction material base failure rate, bcfm, is given by [25, 30]:
bcfm = k E a 2 A Wm CFMa L T
(9.44)
where
k
LT
AWm
CFMa
Ea
9.7.3
(9.45)
where
p
ps
pse
pb
pc
pfd
is
is
is
is
is
is
the
the
the
the
the
the
The failure rate of the pump shaft, ps, can be calculated by using the following
equation:
6
ps = bps
(9.46)
i =1
where
bps is the base failure rate of the pump shaft.
fi is the ith modifying factor; for i = 1 is associated with material temperature, i = 2 pump displacement, i = 3 casing thrust load, i = 4 shaft
surface finish, i = 5 contamination, and i = 6 material endurance limit.
The pump seal failure rate, pse, is expressed by
7
pse = bpse
i =1
(9.47)
where
bpse is the pump seal base failure rate.
fi
is the ith modifying factor; for i = 1 is for effects of casing thrust load,
i = 2 surface finish, i = 3 seal smoothness, i = 4 fluid viscosity, i = 5
pressure/velocity factor, i = 6 temperature, and i = 7 contaminates.
Similarly, the values of pb, pc, and pfd can be calculated. Reference 26 presents
procedures to calculate these failure rates.
9.7.4
The following equation can be used to predict filter failure rate [26]:
6
f = bf
(9.48)
i =1
where
f is the filter failure rate, expressed in failures/106 h.
bf is the filter base failure rate.
fi is the ith modifying factor; for i = 1 is for temperature effects, i = 2
vibration effects, i = 3 water contamination effects, i = 4 cyclic flow
effects, i = 5 cold start effects, and i = 6 differential pressure effects.
9.7.5
The compressor system failure rate is expressed by the following equation [27]:
com = cs + cse + cva + cdc + cc + cb
(9.49)
where
com
cs
cse
cva
cdc
cc
cb
is
is
is
is
is
is
is
the
the
the
the
the
the
the
cs = bcs
i =1
(9.50)
where
bcs is the compressor shaft base failure rate.
fi is the ith modifying factor; for i = 1 is for material temperature, i = 2
material endurance limit, i = 3 displacement, i = 4 shaft surface finish,
and i = 5 contamination.
The compressor seal failure rate can be predicted by using the following equation:
10
cse = bcse
(9.51)
i =1
where
bcse is the compressor seal base failure rate.
fi
is the ith modifying factor; for i = 1 is for fluid pressure, i = 2 seal face
pressure and gas velocity, i = 3 seal smoothness, i = 4 seal size, i = 5
temperature, i = 6 allowable leakage, i = 7 contaminants, i = 8 flow rate,
i = 9 fluid viscosity, and i = 10 surface finish and other conductance
parameters.
The compressor valve assembly failure rate is expressed by
cva = so + sa + pa
(9.52)
where
so is the failure rate of the solenoid (if any).
sa is the failure rate of sliding action valve assy (if any).
pa is the failure rate of poppet assy (if any).
Procedures for calculating cdc, cc, and cb are presented in Reference 27.
9.7.6
2.36
(9.53)
where
br is the failure rate of bearing (using actual conditions), expressed in
failure/106 h of operation.
x is a factor; x = 4.0 for ball bearings and x = 3.33 for roller bearings.
Lc
AE
SV
OV
AL
SL
pbs
is
is
is
is
is
is
is
the
the
the
the
the
the
the
(9.54)
where
M is the number of bearings in system.
RL is the rated life in revolutions and is given by:
RL = 16700 rpm (BLR ERL )
(9.55)
where
rpm
BLR
ERL
Table 9.3 presents failure rates for selected mechanical items. Chapter 4 discusses
the subject of failure data in more depth.
TABLE 9.3
Failure Rates for Selected Mechanical Parts [10, 12]
No.
Mechanical part
Use environment
Failure rate
(failures/106 h)
1
2
3
4
5
6
7
8
9
10
Hose, pneumatic
Pressure regulator
Fuel pump
Compressor (general)
Clutch (friction)
Filter, gas (air)
Heavy-duty ball bearing
Seal, O-ring
Brake assembly
Guide pin
29.3
2.4
176.4
33.6
38.2
3.2
14.4
0.2
16.8
13
9.9.1
This mathematical model is concerned with determining the optimum time interval
between replacements by minimizing the mean annual cost associated with the
equipment with respect to the time between replacements or the equipment life
expressed in years. The model assumes that the equipment mean annual cost is made
up of three components: (1) average operating cost, (2) average maintenance cost,
and (3) average investment cost. Thus, the mean annual cost of the equipment is
expressed by
MC = C0 + C m +
CI (x 1)
+
(IOC + IMC)
x
2
where
MC
C0
Cm
x
IOC
is
is
is
is
is
the
the
the
the
the
(9.56)
(9.57)
x* =
IOC + IMC
12
(9.58)
where
x* is the optimum replacement time interval.
Example 9.6
Assume that the following data are specified for mechanical equipment:
IMC = $500, IOC = $600, and CI = $15, 000
Calculate the optimum time for the mechanical equipment replacement.
Using the above data in Equation (9.58) yields
2 (15, 000)
x* =
600 + 500
12
= 5.2 years
Thus, the optimum time for the mechanical equipment replacement is 5.2 years.
9.9.2
This mathematical model can be used to calculate the optimum number of inspections per facility per unit time. This information is quite useful to decision makers
because inspections are often disruptive but on the other hand such inspections
normally reduce the equipment downtime because of fewer breakdowns. In this
model, expression for the total equipment downtime is used to obtain the optimum
number of inspections.
The total facility downtime per unit time is given by [35]:
T D T = n (D T I) + (D T B) k n
(9.59)
where
TDT
k
n
DTI
DTB
is
is
is
is
is
(9.60)
12
(9.61)
where
n* is the optimum number of inspections per facility per unit time.
Inserting Equation (9.61) into Equation (9.59) yields
T D T* = 2 (DTB) (D TI) k
12
(9.62)
where
TDT* is the minimum total facility downtime per unit time.
Example 9.7
The following data are associated with certain mechanical equipment:
DTI = 0.02 month, DTB = 0.1 month, and k = 4.
(9.63)
Calculate the optimum number of inspections per month and the minimum value of
the total equipment downtime by using Equations (9.61) and (9.62), respectively.
Thus, inserting the given data into Equation (9.61) yields
n* = [4 (0.1) 0.02]
12
12
= 0.18 months
Thus, the optimum number of inspections per month is 4.5 and the minimum
value of the total equipment downtime is 0.18 month.
9.10
PROBLEMS
9.11
REFERENCES
29. Minegishi, H., Prediction of Brake Pad Wear/Life by Means of Brake Severity Factor
as Measured on a Data Logging System, Society of Automotive Engineers (SAE)
Paper No. 840358, 1984, SAE, Warrendale, PA.
30. Spokas, R.B., Clutch Friction Material Evaluation Procedures, SAE Paper No. 841066,
1984, SAE, Warrendale, PA.
31. Shafer, R.E., Angus, J.E., Finkelstein, J.M., Yerasi, M., and Fulton, D.W., RADC
Non-Electronic Reliability Notebook, Report No. RADC-TR-85-194, 1985, Reliability Analysis Center (RAC), Rome Air Development Center (RADC), Griffiss Air
Force Base, Rome, NY.
32. GIDEP Operations Center, U.S. Department of Navy, Naval Weapons Station, Seal
Beach, Corona Annex, Corona, CA.
33. Earles, D.R., Handbook: Failure Rates, AVCO Corporation, Massachussetts, 1962.
34. Guth, G., Development of Nonelectronic Part Cycle Failure Rates, Rept. No. 17D/A050678, December 1977, Martin Marietta Corporation, Orlando, FL. Available from
the National Technical Information Service (NTIS), Springfield, VA.
10
Human Reliability in
Engineering Systems
10.1 INTRODUCTION
The failure of engineering systems is not only caused by hardware or software
malfunctions, but also by human errors. In fact, the reliability of humans plays a
crucial role in the entire system life cycle: design and development phase, manufacturing phase, and operation phase.
Even though in modern times the history of human factors may be traced back
to 1898 when Frederick W. Taylor performed studies to determine the most appropriate designs for shovels [1], it was not until 1958 when Williams [2] recognized
that human-element reliability must be included in the overall system-reliability
prediction; otherwise, such a prediction would not be realistic. In 1960, the work of
Shapero et al. [3] further stressed the importance of human reliability in engineering
systems by pointing out that human error is the cause for 20 to 50% of all equipment
failures.
In 1962, a database known as Data Store containing time and human performance
reliability estimates for human engineering design features was established [4]. Also,
in the 1960s two symposia concerning human reliability/error were held [5, 6]. In
1973, IEEE Transactions on Reliability published a special issue on human
reliability [7]. In 1980, a selective bibliography on human reliability was published
covering the period from 1958 to 1978 [8]. The first book on human reliability
entitled Human Reliability: With Human Factors appeared in 1986 [9].
Over the years many professionals have contributed to the field of human reliability. A comprehensive list of publications on the subject is given in Reference 10.
This chapter presents different aspects of human reliability.
Human errors may be broken into many distinct types as follows [9, 21, 22]:
Design errors. These types of errors are the result of inadequate design.
The causes of these errors are assigning inappropriate functions to
humans, failure to implement human needs in the design, failure to ensure
the man-machine interaction effectiveness, and so on. An example of
design errors is the placement of controls and displays so far apart that
an operator is unable to use them in an effective manner.
Operator errors. These errors are the result of operator mistakes and the
conditions that lead to operator errors include lack of proper procedures,
complex tasks, poor personnel selection and training, poor environment,
and operator carelessness.
Assembly errors. These errors occur during product assembly due to
humans. Assembly errors may occur due to causes such as poorly designed
work layout, inadequate illumination, excessive noise level, poor blueprints and other related material, excessive temperature in the work area,
and poor communication of related information.
Inspection errors. These errors occur because of less than 100% accuracy
of inspectors. One example of an inspection error is accepting and rejecting out-of-tolerance and in-tolerance parts, respectively. Nonetheless,
according to Reference 23, an average inspection effectiveness is close to
85%.
Maintenance errors. These errors occur in the field due to oversights by
the maintenance personnel. As the equipment becomes old, the likelihood
of the occurrence of those errors may increase because of the increase in
maintenance frequency. Some of the examples of maintenance errors are
calibrating equipment incorrectly, applying the wrong grease at appropriate points of the equipment, and repairing the failed equipment incorrectly.
Installation errors. These errors occur due to various reasons including
using the wrong installation related blueprints or instructions, or simply
failing to install equipment according to the manufacturers specification.
Handling errors. These errors basically occur because of inadequate
storage or transportation facilities. More specifically, such facilities are
not as specified by the equipment manufacturer.
In general, there could be numerous causes for the occurrence of human error. Some
of those are listed below [9, 21].
FIGURE 10.2
AND
There are many factors that increase the stress on a human and in turn decrease
his/her reliability in work and other environments. Some of these factors are listed
below [25].
A human operator has certain limitations in conducting certain tasks. The error
occurrence probability increases when such limitations are exceeded. In order to
improve operator reliability, such operator limitations or stress characteristics must
be considered carefully during the design phase.
The operator stress characteristics include [12]: very short decision-making time,
several displays difficult to discriminate, requirement to perform steps at high speed,
poor feedback for the determination of accuracy of actions taken, requirement for
prolonged monitoring, very long sequence of steps required to perform a task, requirement to make decisions on the basis of data obtained from various different sources,
and requirement to operate at high speed more than one control simultaneously.
(10.1)
where
X1
X2
P X 2 X1 P ( X1 ) = P ( X1 ) P ( X 2 X1 ) P ( X1 )
(10.2)
where
represents an event that human error will not occur during interval [t,
t + t].
(10.3)
(10.4)
(10.5)
1
d R h (t )
R h (t )
(10.6)
By integrating both sides of Equation (10.6) over the time interval [o, t] results in
t
h (t ) d t =
o
R (t ) d R (t )
1
(10.7)
h (t ) d t =
o
R (t ) d R (t )
1
(10.8)
l n R h (t ) = h (t ) d t
(10.9)
R h (t ) = e
h (t )d t
0
(10.10)
The above equation is the general human performance reliability function. It can be
used to predict human reliability at time t when times to human error are described
by any known statistical distribution. A study reported in Reference 27 collected
human error data for time-continuous tasks under laboratory conditions. The Weibull,
gamma, and log-normal distributions fitted quite well to these data.
In order to obtain a general expression for mean time to human error (MTTHE),
we integrate Equation (10.10) over the interval [0, ]:
MTTHE =
R h ( t ) d t = exp h (t ) d t d t
0
o
(10.11)
The above equation can be used to obtain MTTHE when times to human error are
governed by any known distribution function.
EXAMPLE 10.1
Assume that the time to human error associated with a time-continuous task is
exponentially distributed and the human error rate is 0.02 errors per hour. Calculate
the following:
Human performance reliability for a 5-h mission.
MTTHE.
Thus, in this case we have
h (t ) = h = 0.02 errors h
Inserting the above value into Equation (10.10) yields
t
R h (t ) = e
=e
( 0.02 ) d t
o
(10.12)
( 0.02 ) t
(10.13)
(0.02)
= 50 h
It means the human performance reliability is 0.9048 and a human error can
occur after every 50 h.
EXAMPLE 10.2
A person is performing a time-continuous task and his/her time to human error is
described by Weibull distribution. Obtain expressions for human reliability and
MTTHE.
In this case, as the time to human error is Weibull distributed, the human error
rate at time t is given by
h ( t ) = ( t )
(10.14)
where
is the shape parameter.
is the scale parameter.
By inserting Equation (10.14) into Equation (10.10) we have
t
R h (t ) = e
[( t )
dt
(10.15)
= e ( t )
MTTHE = exp (t )
o
] dt
(10.16)
1
= + 1
where
() is the gamma function.
The human reliability and MTTHE are given by Equations (10.15) and (10.16),
respectively.
10.7
Over the years, many techniques to evaluate human reliability have been developed [9].
Each such technique has advantages and disadvantages. Some of these methods are
presented below.
EXAMPLE 10.3
A person performs two quality control tasks: and . Each of these two tasks can
either be performed correctly or incorrectly. The task is performed before task
and both the tasks are independent of each other. In other words, the performance
of task does not affect the performance of task or vice-versa. Develop a
probability tree for this example and obtain an expression for probability of not
successfully completing the overall mission.
In this case, the person first performs task correctly or incorrectly and then
proceeds to task which can also be performed correctly or incorrectly. This scenario
is depicted by the probability tree shown in Figure 10.3. In this figure, and with
bars denote unsuccessful events and without bars denote successful events. Other
symbols used in the solution to the example are as follows:
Ps is the probability of success of the overall mission (i.e., performing both
tasks and correctly).
Pf is the failure probability of the overall mission.
P is the probability of performing task correctly.
P is the probability of performing task incorrectly.
P is the probability of performing task correctly.
P is the probability of performing task incorrectly.
Using Figure 10.3, the probability of success of the overall mission is
Ps = P P
(10.17)
Similarly, using Figure 10.3, the failure probability of the overall mission is given by
Pf = P P + P P + P P
(10.18)
Ps = 1 Pf = 1 P P + P 1 P + P (1 P )
Ps = 1 P + P P P
(10.19)
(10.20)
The above equation is exactly the same as Equation (10.17), if we write the right-hand
term of Equation (10.17) in terms of failure probabilities. Nonetheless, the probability
of not successfully completing the overall mission is given by Equation (10.18).
Example 10.4
Assume that in Example 10.3 the probabilities of performing tasks and incorrectly are 0.15 and 0.2, respectively. Determine the following:
Probability of successfully completing the overall mission.
Probability of not successfully completing the overall mission.
The sum of the above two results is equal to unity.
of the job. Task X is made up of two subtasks X1 and X2. If any one of these two
subtasks is performed correctly, Task X can be completed successfully. Task Y is
composed of subtasks Y1, Y2, and Y3. All of these three subtasks must be performed
correctly for the success of Task Y. Both subtasks, X1 and Y1, are composed of two
steps each, i.e., x1, x2 and y1, y2, respectively. Both steps for each of these two
subtasks must be completed correctly for subtask success. Develop a fault tree for
the event that job J will be performed incorrectly by the person.
Figure 10.4 presents the fault tree for the example.
Example 10.6
Assume that the probability of occurrence of the basic events in Figure 10.4 is 0.04.
Calculate the probability of occurrence of the top event, Job J will be performed
incorrectly by the person. Assume in your calculations that the given fault tree is
independent.
The probability of performing subtask X1 incorrectly is given by
P ( X1 ) = P (x1 ) + P (x 2 ) P (x1 ) P (x 2 )
= 0.04 + 0.04 (0.04) (0.04)
= 0.0784
where
P (x1) is the probability of performing step x1 incorrectly, for i = 1, 2.
)(
)(
FIGURE 10.5 Fault tree with given and calculated probability values.
FIGURE 10.6 State space diagram for a human performing a time continuous task.
The numerals in the circles of Figure 10.6 denote system states. The following
symbols are associated with Figure 10.6:
P0 (t) is the probability that the human is performing his/her assigned task
normally at time t.
P1 (t) is the probability that the human has committed error at time t.
h
is the constant human error rate.
Using the Markov approach, we write down the following equations for Figure 10.6.
P0 (t + t ) = P0 (t ) (1 h t )
(10.21)
P1 (t + t ) = P0 (t ) ( h t ) + P1 (t )
(10.22)
where
P0 (t + t) is the probability that the human is performing his/her assigned
task normally at time t + t.
P1 (t + t) is the probability that the human has committed error at time t.
(1 h t) is the probability of the occurrence of no human error in finite
time t.
Rearranging Equations (10.21) and (10.22) and taking the limits, we get
d P0 (t )
= h P0 (t )
dt
(10.23)
R h (t + t ) = R h (t )
(10.24)
(10.25)
P1 (t ) = 1 e h t
(10.26)
(10.27)
By integrating Equation (10.27) over the time interval [0, ], we get the following
equation for MTTHE:
MTTHE =
R (t ) d t
h
= eh t d t
(10.28)
= 1 h
Example 10.7
Assume that a person is performing a time continuous task and his/her human error
rate is 0.006 errors per hour. Calculate the probability that the person will commit
an error during an 8-h mission.
Substituting the given data into Equation (10.26) yields
P1 (8) = 1 e ( 0.006 ) (8)
= 0.0469
There is an approximately 5% chance that the person will commit an error during
the 8-h mission.
OF A
SYSTEM
WITH
HUMAN ERROR
This mathematical model represents a system which can fail either due to a hardware
failure (i.e., a failure other than a human error) or to a human error. The system
human/non-human failure rates are constant. The system state space diagram is
shown in Figure 10.7. By using the Markov approach we can obtain equations for
system reliability, system failure probability due to human error, system failure
probability due to non-human error, and system mean time to failure. The numerals
in the boxes of Figure 10.7 denote system state.
The following symbols are associated with this model:
h is the constant human error rate.
nh is the constant nonhuman error rate (i.e., hardware failure rate, etc.)
Pi (t) is the probability that the system is in state i at time t; for i = 0 (means
system operating normally), i = 1 (means system failed due to nonhuman error), and i = 2 (means system failed due to human error).
By applying the Markov method and using Figure 10.7, we get the following equations:
d P0 (t )
+ ( h + nh ) P0 (t ) = 0
dt
(10.29)
d P1 (t )
nh P0 (t ) = 0
dt
(10.30)
d P2 (t )
h P0 (t ) = 0
dt
(10.31)
(10.32)
P1 (t ) =
nh
+ t
1 e ( h nh )
( nh + h )
(10.33)
P2 (t ) =
( nh + h ) [
(10.34)
+ t
1 e ( h nh )
(10.35)
MTTF =
R (t ) d t
S
+ t
= e ( h nh ) d t
(10.36)
= 1 ( h + nh )
Example 10.7
A system can fail either due to a hardware failure or a human error and its hardware
failure and human error rates are 0.0005 failures/h and 0.0001 errors/h, respectively.
Calculate the system MTTF and failure probability due to human error for a 12-h
mission.
Inserting the given data into Equation (10.36) yields
MTTF = 1 (0.0001 + 0.0005)
= 1666.7 h
Similarly, we substitute the specified data into Equation (10.34) to get
P2 (12) =
(0.0001) e 1
(0.0001 + 0.0005) [
= 0.0012
Thus, the system failure probability due to human error for a 12-h mission is
0.0012.
10.8.2 RELIABILITY ANALYSIS OF A HUMAN PERFORMING A TIMECONTINUOUS TASK UNDER FLUCTUATING ENVIRONMENT
This mathematical model represents a human performing a time continuous task
under fluctuating environment. The environment fluctuates from normal to abnormal
(stress) and vice versa [32]. Human error can occur under both environments. However, it is logical to assume that the human error rate under the abnormal environment
will be greater than under the normal environment because of increased stress. In
this model, it is also assumed that the rate of change from normal to abnormal
environment and vice versa is constant along with human error rate under either
environment. The model state space diagram is shown in Figure 10.8. The numerals
in the boxes of this figure denote system state. The following symbols were used to
develop equations for the model:
FIGURE 10.8 State space diagram of a human performing a time continuous task under
fluctuating environment.
(10.37)
d P1 (t )
P0 (t ) h = 0
dt
(10.38)
d P2 (t )
+ ( ah + a ) P2 ( t ) = P0 (t ) n
dt
(10.39)
d P3 (t )
P2 (t ) ah = 0
dt
(10.40)
[(s
+ ah + n ) es2 t 1 (s + ah +1 n ) es t
(10.41)
where
]2
(10.42)
]2
(10.43)
s1 = a1 + a12 4 a 2
s2 = a1 a12 4 a 2
12
12
a1 h + ah + n + a
(10.44)
a 2 h ( ah + a ) + n ah
(10.45)
P1 (t ) = a 4 + a 5 es2 t a 6 es1t
(10.46)
a 3 = 1 (s2 s1 )
(10.47)
a 4 = h ( ah + a ) 1s s2
(10.48)
a 5 = a 3 ( h + a 4 s1 )
(10.49)
a 6 = a 3 ( h + a 4 s2 )
(10.50)
where
P2 (t ) = n a 3 es2 t es1t
P3 (t ) = a 7 (1 + a 3 ) s1 es2 t s2 es1t
(10.51)
)]
(10.52)
where
a 7 ah n s1 s2
(10.53)
(10.54)
MTTHE =
R (t ) d t
h
[P (t) + P (t)] d t
0
(10.55)
= ( ah + n + a ) a 2
Example 10.8
Assume that a person is performing a time continuous task under normal and
abnormal environments with error rates 0.002 errors/h and 0.003 errors/h, respectively. The transition rates from normal to abnormal environment and vice versa are
0.02 per hour and 0.04 per hour, respectively. Calculate the MTTHE.
Substituting the given data into Equation (10.55) we get
MTTHE =
= 431.51 h
Thus, the mean time to human error is 431.51 h.
Experimental studies
Expert judgments
Self-made error reports
Human data recorder
Automatic data recorder
Published literature
The data collected from experimental studies is usually generated under the
laboratory conditions. These conditions may not be the true representatives of actual
conditions. In addition, the method is time consuming and rather expensive. The
main advantage of data collected through experimental means is that the data are
probably the least influenced by the subjective elements that may procure some
error. One example of data based on the experimental findings is the Data Store [4].
Expert judgments are another approach to obtain human reliability data. This
approach is used quite often by human reliability methodologists and has two
attractive features: (i) it is comparatively easy to develop because a large amount of
data can be collected from a small number of expert respondents, and (ii) it is
relatively cheaper to develop. Some of the drawbacks of this method are less reliable
than data obtained through other means and frequent use of less experienced experts
than required.
In the case of self-made error reports, the person who makes the error also
reports that error. One of the drawbacks of this method is that people are generally
reluctant to confess making an error. The human data recorder approach calls for
the physical presence of a person to observe task performance and document events
as necessary. Some of the disadvantages of this approach are that it is expensive and
the observer may fail to recognize committed errors. In operational system testing
of human-machine systems, the human data recorder approach is used quite frequently. In the case of the automatic data recorder method, the use of instrumentation
permits the automatic recording of all operator actions. Two typical examples are the
Operational Performance Recording and Evaluation Data System (OPREDS) [35]
and the Performance Measurement System. The latter system was developed by the
General Physics Corporation for recording actions in nuclear power plant simulators.
The published literature approach calls for collecting the data from publications such
as journals, conference proceedings, and books.
AND
SOURCES
Over the years many data banks for obtaining human error related information have
been developed [9, 37, 38]. Nine such data banks listed in Table 10.1 are reviewed
in Reference 39.
TABLE 10.1
Human Error Data Banks
No.
1
2
3
4
5
6
7
8
9
Other important sources for obtaining human error related data are References 45
through 48. Over 20 sources for obtaining human reliability data are listed in
References 9 and 34.
OF
This section briefly describes three important data banks for obtaining human error
related information [39].
Data Store. It has served as an important source for obtaining human
reliability data [49]. The Data Store was established in 1962 by the American Institute for Research, Pittsburgh [4] and it contains estimates for
time and human performance reliability. More specifically, this data bank
possesses data relating human error probabilities to design features. All
in all, Data Store is probably the most developed data bank for application
during the design process.
Aviation Safety Reporting System. Originally this data bank was developed by the National Aeronautics and Space Administration (NASA) and
contains information on civil aircraft accidents/incidents. However, the
data is based on voluntary reporting and it receives roughly 400 human
error reports per month.
Operation Performance Recording and Evaluation Data System. This
system was developed to collect data on operational human performance
by the U.S. Navy Electronics Laboratory, San Diego. The system permits
automatic monitoring of human operator actions and in the late 1960s and
early 1970s it was used to collect data on various U.S. Navy ships. Nonetheless, the system was restricted to switch turning and button-pushing.
FOR
SELECTIVE TASKS
In order to give examples of the types of available human reliability data, Table 10.2
presents human error data for selective tasks taken from published sources [34].
10.10
PROBLEMS
TABLE 10.2
Human Error Data for Selective Tasks
Number
Error/task description
2
3
4
5
6
7
8
Performance
reliability
Errors per
plant-month
(boiling water
reactors)
Errors per
million
operations
0.997
0.0074
5000
0.9996
0.026
0.03
4700
66700
10.11
REFERENCES
21. Meister, D., The problem of human-initiated failures, Proc. Eighth Natl. Symp. Reliability and Quality Control, 234-239, 1962.
22. Cooper, J.I., Human-initiated failures and man-function reporting, IRE Trans. Human
Factors, 10, 104-109, 1961.
23. McCornack, R.L., Inspector Accuracy: A Study of the Literature, Report No. SCTM
53-61 (14), 1961, Sandia Corporation, Albuquerque, NM.
24. Lee, K.W., Tillman, F.A., and Higgins, J.J., A literature survey of the human reliability
component in a man-machine system, IEEE Trans. Reliability, 37, 24-34, 1988.
25. Beech, H.R., Burns, L.E., and Sheffield, B.F., A Behavioural Approach to the Management of Stress, John Wiley & Sons, Chichester, 1982.
26. Hagen, E.W., Ed., Human reliability analysis, Nuclear Safety, 17, 315-326, 1976.
27. Regulinski, T.L. and Askern, B., Mathematical modeling of human performance
reliability, Proc. Annu. Symp. Reliability, 5-11, 1969.
28. Askern, W.B. and Regulinski, T.L., Quantifying human performance for reliability
analysis of systems, Human Factors, 11, 393-396, 1969.
29. Regulinski, T.L. and Askren, W.B., Stochastic modelling of human performance
effectiveness functions, Proc. Annu. Reliability Maintainability Symp., 407-416,
1972.
30. Swain, A.D., A Method for Performing a Human-Factors Reliability Analysis, Report
No. SCR-685, Sandia Corporation, Albuquerque, NM, August 1963.
31. Dhillon, B.S. and Rayapati, S.N., Reliability evaluation of multi-state device networks
with probability trees, Proc. Sixth Symp. Reliability in Electron., Hungarian Academy
of Sciences, Budapest, August 1985, pp. 27-37.
32. Dhillon, B.S., Stochastic models for predicting human reliability, Microelectronics
and Reliability, 21, 491-496, 1982.
33. Meister, D., Human Reliaiblity, in Human Factors Review, Muckler, F.A., Ed., Human
Factors Society, Santa Monica, CA, 1984, pp. 13-53.
34. Dhillon, B.S., Human error data banks, Microelectronics and Reliability, 30, 963-971,
1990.
35. Dhillon, B.S. and Singh, C., Engineering Reliability: New Techniques and Applications, John Wiley & Sons, New York 1981.
36. Urmston, R., Operational Performance Recording and Evaluation Data System
(OPREDS), Descriptive Brochures, Code 3400, Navy Electronics Laboratory Center,
San Diego, CA, November 1971.
37. Kohoutek, H.J., Human centered design, in Handbook of Reliability Engineering and
Management, Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Eds., McGraw-Hill,
New York, 1996, pp. 9.19.30.
38. LaSala, K.P., Human reliability: An overview, tutorial notes, Annu. Reliability Maintainability Symp., 145, 1992.
39. Topmiller, D.A., Eckel, J.S., and Kozinsky, E.J., Human Reliability Data Bank for
Nuclear Power Plant Operations: A Review of Existing Human Reliability Data
Banks, Report No. NUREG/CR2744/1, U.S. Nuclear Regulatory Commission, Washington, D.C., 1982.
40. Reporting Procedures Manual for the Nuclear Plant Reliability Data System
(NPRDS), South-West Research Institute, San Antonio, TX, December 1980.
41. Irwin, I.A., Levitz, J.J., and Freed, A.M., Human Reliability in the Performance of
Maintenance, Report No. LRP 317/TDR-63-218, Aerojet-General Corporation, Sacramento, CA, 1964.
42. Aviation Safety Reporting Program, FAA Advisory Circular No. 00-46B, Federal
Aviation Administration (FAA), Washington, D.C., June 15, 1979.
43. Hornyak, S.J., Effectiveness of Display Subsystems Measurement Prediction Techniques, Report No. TR-67-292, Rome Air Development Center (RADC), Griffis Air
Force Base, Rome, NY, September 1967.
44. Life Sciences Accident and Incident Classification Elements and Factors, AFISC
Operating Instruction No. AFISCM, 127-6, U.S. Air Force, Washington D.C., December 1971.
45. Stewart, C., The Probability of Human Error in Selected Nuclear Maintenance Tasks,
Report No. EGG-SSDC-5580, Idaho National Engineering Laboratory, Idaho Falls,
ID, 1981.
46. Boff, K.R. and Lincoln, J.E., Engineering Data Compedium: Human Perception and
Performance, Vols. 1-3, Armstrong Aerospace Medical Research Laboratory, WrightPatterson Air Force Base, Ohio, 1988.
47. Gertman, D.I. and Blackman, H.S., Human Reliability and Safety Analysis Data
Handbook, John Wiley & Sons, New York, 1994.
48. Swain, A.D. and Guttmann, H.E., Handbook of Human Reliability Analysis with
Emphasis on Nuclear Power Plant Applications, Report No. NUREG/CR-1278, The
United States Nuclear Regulatory Commission, Washington, D.C., 1983.
49. Meister, D., Human reliability data base and future systems, Proc. Annu. Reliability
Maintainability Symp., 276-280, 1993.
11
Reliability Testing
and Growth
11.1 INTRODUCTION
Just like in the case of any other reliability activity, reliability testing is an important
reliability task. In fact, it may be called one of the most important reliability activities
of a reliability program. The main purpose of reliability testing is to obtain information regarding failures, in particular, the product/equipment tendency to fail as
well as the failure consequences. This type of information is extremely useful in
controlling failure tendencies along with their consequences. A good reliability test
program may be classified as the one providing the maximum amount of information
concerning failures from a minimal amount of testing [1]. Over the years, many
important publications on reliability testing have appeared; in particular, two such
publications are listed in References 2 and 3.
In the design and development of new complex and sophisticated systems, the
first prototypes usually contain various design and engineering related deficiencies.
In fact, according to References 4 and 5, the reliability of revolutionary design
product/systems could be very low, i.e., 15 to 50% of their mature design capability.
It means without the initiation of various corrective measures during the development
stage to improve reliability, such products/systems reliability could very well
remain at the low initial value. Nonetheless, correcting the weaknesses or errors in
design, manufacturing methods, elimination of bad components, etc. leads to a
products reliability growth [6]. The term reliability growth may be defined as the
positive improvement in a reliability parameter over a time period because of changes
made to the product design or the manufacturing process [7]. Similarly, the term
reliability growth program may be described as a structure process for discovering
reliability related deficiencies through testing, analysis of such deficiencies, and
implementation of corrective measures to lower their occurrence rate.
The serious thinking concerning reliability growth may be traced back to the
late 1950s. In 1964, a popular reliability growth monitoring model was postulated
by Duane [8]. Comprehensive lists of publications up to 1980 on reliability growth
are given in Refences 9 and 10.
This chapter discusses reliability testing and growth.
(11.1)
where
k is the number items/units placed on test.
is the level of significance or consumers risk.
Thus, with 100 (1 )% confidence, it may be stated that
R Low R T
(11.2)
where
RT is the true reliability.
Taking the natural logarithms of the both sides of Equation (11.1) leads to
ln R Low =
1
ln
k
(11.3)
ln
l n R Low
(11.4)
(11.5)
= 1 C
(11.6)
l n (1 C)
l n RT
(11.7)
Equation (11.7) can be used to determine the number of items to be tested for
specified reliability and confidence level.
Example 11.1
Assume that 95% reliability of a television set is to be demonstrated at 85% confidence level. Determine the number of television sets to be placed on test when no
failures are allowed.
l n (1 0.85)
l n 0.95
= 37
Thus, 37 television sets must be placed on test.
FOR
MEAN TIME
In many practically inclined reliability studies, the time to item failure is assumed
to be exponentially distributed. Thus, the item failure rate becomes constant and, in
turn, the mean time between failures (MTBF) is simply the reciprocal of the failure
rate (i.e., 1/, where is the item constant failure rate).
In testing a sample of items with exponentially distributed times to failures, a
point estimate of MTBF can be made but, unfortunately, this figure provides an
incomplete picture because it fails to give any surety of measurement. However, it
would probably be more realistic if we say, for example, that after testing a sample
for t hours, m number of failures have occurred and the actual MTBF lies somewhere
between specific upper and lower limits with certain confidence.
The confidence intervals on MTBF can be computed by using the 2 (chi-square)
distribution. The general notation used to obtain chi-square values is as follows:
2 ( p, d f )
(11.8)
where
p is a quantity, function of the confidence coefficient.
df is the degrees of freedom.
The following symbols are used in subsequent associated formulas [11, 13]:
C = 1-
k
m
m*
Thus, for the above two cases, to compute upper and lower limits, the formulas that
can be used are as follows [11, 13]:
Preassigned truncation time, t*
2X
2X
2
, 2 m + 2 1 , 2 m
2
2
(11.9)
2
, 2 m 1 , 2 m
2
2
(11.10)
The value of X is determined by the test types: replacement test (i.e., the failed unit
is replaced or repaired), non-replacement test.
Thus, for the replacement test, we have
X = k t*
(11.11)
X = ( k m) t * +
(11.12)
i =1
where
t i is the ith failure time.
In the case of censored items (i.e., withdrawal or loss of unfailed items) the value
of X becomes as follows:
For replaced failed units but non-replacment of censored items
s
X = ( k s) t * +
t
j =1
where
s is the number of censored items
t j is the jth, censorship time.
(11.13)
X = ( k s m) t * +
t
tj +
j =1
(11.14)
i =1
1380
= 345 h
4
Inserting the given and other values into relationship (11.9) and using Table 11.1
yields the following upper and lower limit values for the MTBF:
Upper limit =
=
2 (1380)
2 (0.95, 8)
2 (1380)
2.73
= 1010.9 h
Lower limit =
=
2 (1380)
(0.05,10)
2
2 (1380)
18.30
= 150.8 h
TABLE 11.1
Values of Chi-Square
Probability
Degrees of
freedom
0.99
0.95
0.9
0.5
0.1
0.05
0.01
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0.02
0.29
0.87
1.64
2.55
3.57
4.66
5.81
7.01
8.26
9.54
10.85
12.19
13.56
14.95
0.1
0.71
1.63
2.73
3.94
5.22
6.57
7.96
9.39
10.85
12.33
13.84
15.37
16.92
18.49
0.21
1.06
2.2
3.49
4.86
6.3
7.79
9.31
10.86
12.44
14.04
15.65
17.29
18.93
20.59
1.38
3.35
5.34
7.34
9.34
11.34
13.33
15.33
17.33
19.33
21.33
23.33
25.33
27.33
29.33
4.6
7.77
10.64
13.36
15.98
18.54
21.06
23.54
25.98
28.41
30.81
33.19
35.56
37.91
40.25
5.99
9.44
12.59
15.5
18.3
21.02
23.68
26.29
28.86
31.41
33.92
36.41
38.88
41.33
43.77
9.21
13.27
16.81
20.09
23.2
26.21
29.14
32
34.8
37.56
40.28
42.98
45.64
48.27
50.89
Thus, we can state with 90% confidence that the medical devices true MTBF
will lie within 150.8 h and 1010.9 h or 150.8 1010.9.
Example 11.3
Assume that 20 identical electronic parts were put on test at zero time and at the
occurrence of the tenth failure, the testing was stopped. The tenth failure occurred
at 150 h and all the failed parts were replaced.
Determine the MTBF of the electronic parts and upper and lower limits on
MTBF at 80% confidence level.
Inserting the given data into Equation (11.11) we get
X = (20) (150) = 3000 h
Thus, electronic parts MTBF is
=
3000
= 300 h
10
Substituting the specified and other values into relationship (11.10) and using
Table 11.1, we get the following values of MTBF upper and lower limits:
2 (3000)
(0.90, 20)
Upper limit =
=
6000
12.44
= 482.3 h
Lower limit =
=
2 (3000)
2 (0.1, 20)
6000
28.41
= 211.1 h
At 80% confidence level, the electronic parts true MTBF will lie within 211.1 h
and 482.3 h or 211.1 482.3.
Example 11.4
On the basis of data given in Example 11.3, determine the following:
Probability of an electronic parts success for a 200-h mission.
Upper and lower limits on this probability at 80% confidence level.
Thus, the electronic parts reliability is
R (200) = e
200
300
= 0.5134
The upper and lower limit reliability are as follows:
Upper limit = e
200
482.3
= 0.6606
Lower limit = e
200
211.1
= 0.3877
Thus, the reliability of the electronic part is 0.5134 and its upper and lower
values at 80% confidence level are 0.6606 and 0.3877, respectively.
1 t
f
s
(11.15)
where
t
is time.
fn (t) is the normal operating condition failure probability density
function.
(11.16)
where
Fs
Time to failure
The time to failure at normal operating condition is given by
t n = ts
(11.17)
where
tn is the time to failure at normal operating condition.
ts is the time to failure at stressful operating condition.
Hazard rate
The normal operating condition hazard rate is given by
h n (t ) =
fn (t )
1 Fn (t )
(11.18)
(11.19)
1 t
h
s
where
hs
(11.20)
Acceleration Model
For an exponentially distributed time to failure at an accelerated stress, s, the
cumulative distribution function is
Fs ( t ) = 1 e s t
(11.21)
where
s is the parameter or the constant failure rate at the stressful level.
Thus, from Equation (11.16) and (11.21), we get
t
Fn (t ) = Fs = 1 e s t
(11.22)
(11.23)
where
n is the constant failure rate at the normal operating condition.
For both non-censored and censored data, the failure rate at the stressful level can
be estimated from the following two equations, respectively [14]:
Noncensored data
k
s = k
(11.24)
j =1
where
k is the total number of items under test at a certain stress.
tj is the jth failure time; for j = 1, 2, 3, 4, , k.
Censored data
s = q
kq
t
tj +
j =1
j =1
(11.25)
where
q is the number of failed items at the accelerated stress.
tj is the jth censoring time.
Example 11.5
Assume that a sample of 40 integrated circuits were accelerated life tested at 135C
and their times to failure were exponentially distributed with a mean value of 7500 h.
If the value of the acceleration factor is 30 and the integrated circuits normal
operating temperature is 25C, calculate the integrated circuits, operating at the
normal conditions, failure rate, mean time to failure, and reliability for a 5000-h
mission.
In this case, the failure rate of the integrated circuits at the accelerated temperature is given by
s = 1 (integrated circuits' mean)life under accelerated testing
=
1
0.000133 failure h
7500
Inserting the above result and the specified data into Equation (11.23), we get
n
0.000133
30
4.4333 10 6 failure h
Thus, the integrated circuits mean time to failure (MTTFn) at the normal operating condition is
MTTFn =
1
n
225, 563.9 h
The integrated circuits reliability for a 5000-h mission at the normal operating
condition is
({
Thus, the integrated circuits failure rate, mean time to failure and reliability at
the normal operation are 4.4333 x 10-6 failure/h, 225,563.9 h, and 0.9781, respectively.
vs. cumulative failure rate fell close to a straight line on log-log paper, under the
maintenance of a continued reliability effort. Thus, he defined the cumulative failure
rate of his model as follows:
k =
f
= T
T
(11.26)
where
k is the cumulative failure rate.
is a parameter denoting the growth rate.
is a parameter determined by circumstances such as product complexity,
design margin, and design objective.
T is the total test hours.
f denotes failures during T.
In order to estimate the values of the parameters, we take the logarithms of both
sides of Equation (11.26) to get
log k = log log T
(11.27)
Equation (11.27) is the equation for a straight line. Thus, the plot of the logarithm
of the cumulative failure rate, k, against the logarithm of cumulative operating
hours, T, can be used to estimate the values of and . The slope of the straight
line is equal to and at T = 1, is equal to the corresponding cumulative failure rate.
The least-squares method can be used to have a more accurate straight line fit
in estimating and [10].
(11.28)
(11.29)
where
(t) is the intensity function.
The instantaneous mean time between failures is expressed by
m (t ) = 1 (t )
(11.30)
where
m (t) is the instantaneous mean time between failures.
The model is described in detail in References 2 and 7.
11.5 PROBLEMS
1. Discuss the following:
Demonstration testing
Qualification and acceptance testing
Success testing
2. What is the difference between reliability demonstration testing and reliability growth modeling?
3. Assume that 99% reliability of a medical device is to be demonstrated at
60% confidence level. Calculate the total number of medical devices to
be placed on test when no failures are allowed.
4. What are the two approaches used to perform an accelerated life test?
Describe them in detail.
5. What are the benefits of reliability growth testing?
6. Twenty-five identical electrical devices were put on test at zero time, none
of the failed units were replaced, and the test was stopped after 200 h.
Seven devices malfunctioned after 14, 25, 30, 65, 120, 140, and 150 h of
operation. Calculate the electrical devices mean time between failures
and their associated upper and lower limits with 80% confidence level.
11.6 REFERENCES
1. AMC Pamphlet AMCP 702-3, Quality Assurance, U.S. Army Material Command,
Washington, D.C., 1968.
2. MIL-HDBK-781, Reliability Test Methods, Plans and Environments for Engineering
Development, Qualification and Production, Department of Defense, Washington,
D.C.
3. MIL-STD-781, Reliability Design Qualification and Production Acceptance Test:
Exponential Distribution, U.S. Department of Defense, Washington, D.C.
4. Benton, A.W. and Crow, L.H., Integrated reliability growth testing, Proc. Annu.
Reliability Maintainability Symp., 160-166, 1989.
5. Crow, L.H., Reliability growth management, models, and standards, in tutorial notes,
Annu. Reliability Maintainability Symp., 1-12, 1994.
6. Mead, P.H., Reliability growth of electronic equipment, Microelectronics and Reliability, 14, 439-443, 1975.
7. MIL-HDBK-189, Reliability Growth Management, Department of Defense, Washington, D.C.
8. Duane, J.T., Learning curve approach to reliability monitoring, IEEE Trans. Aerospace, 563-566, 1964.
9. Dhillon, B.S., Reliability growth: A survey, Microelectronics and Reliability, 20,
743-751, 1980.
10. Dhillon, B.S., Reliability Engineering in Systems Design and Operation, Van Nostrand Reinhold Company, New York, 1983.
11. Von Alven, W.H., Ed., Reliability Engineering, Prentice-Hall, Englewood Cliffs, NJ,
1964.
12. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Handbook of Reliability Engineering
and Management, McGraw-Hill, New York, 1996.
13. Dhillon, B.S., Systems Reliability, Maintainability and Management, Petrocelli
Books, New York, 1983.
14. Elsayed, E.A., Reliability Engineering, Addison Wesley Longman, Reading, MA,
1996.
15. Nelson, W., Accelerated Testing, John Wiley & Sons, New York, 1980.
16. Bain, L.J. and Engelhardt, M., Statistical Analysis of Reliability and Life-Testing
Models: Theory, Marcel Dekker, New York, 1991.
17. Meeker, W.Q. and Hahn, G.J., How to Plan an Accelerated Life Test: Some Practical
Guidelines, American Society for Quality Control (ASQC), Milwaukee, WI, 1985.
18. Tobias, P.A. and Trindade, D., Applied Reliability, Van Nostrand Reinhold Company,
New York, 1986.
19. Crow, L.H., Estimation procedures for the Duane model, Proc. U.S. Army Mater.
Syst. Anal. Act. (AMSAA) Reliability Growth Symp., Aberdeen Proving Ground,
Maryland, September 1972.
20. Crow, L.H., Reliability analysis for complex repairable systems, in Reliability and
Biometry, Proschan, F. and Serfling, R.J., Eds., Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1974, pp. 379-410.
12
Reliability in
Computer Systems
12.1 INTRODUCTION
In recent years, the applications of computers have increased quite dramatically
ranging from personal use to control space systems. Todays computers are much
more complex and powerful than their infant versions. As the computers are made
up of both the software and hardware components, the percentage of the total
computer cost spent on software has changed quite remarkably from the days of the
first generation computers. For example, in 1955 software (i.e., including software
maintenance) accounted for 20% of the total computer cost and in 1985, the software
cost increased to 90% [1].
For an effective performance of a computer, both its hardware and software must
function with considerable reliability. In fact, according to various past studies, the
software is many times more likely to fail than hardware. Nonetheless, as computers
are becoming more complex and sophisticated, the demand on their reliability has
increased exponentially. For example, NASAs Saturn V Launch computer (circa
1964) had a mean time to failure (MTTF) goal of 25,000 h and the SIFT and FTMP
avionic computers designed and developed in the late 1970s to control dynamically
unstable aircraft were expected to have a MTTF of 1 billion hours [2, 3]. The early
history of computer hardware reliability may be traced back to the works of
Shannon [4], Hamming [5], Von Neumann [6], and Moore and Shannon [7]. For
example, the triple modular redundancy (TMR) scheme was first proposed by Von
Neumann [6] to improve system reliability in 1956. Since then, many other people
have contributed to the computer hardware reliability and a comprehensive list of
publications on the subject is given in Reference 8.
The serious effort on software reliability appeared to have started at Bell Laboratories in 1964 [9]. An evidence of this effort is a histogram of problems per month
concerning switching system software. In 1967, Floyd [10] considered approaches
for formal validation of software programs and Hudson [11] proposed Markov
birthdeath models. Since those years, a large number of publications on the subject
have appeared [8].
This chapter presents important aspects of computer hardware and software
reliability.
TABLE 12.1
Comparison of Hardware and Software Reliability
No.
Hardware reliability
Software reliability
1
2
Wears out.
Many hardware parts fail according to the
bathtub hazard rate curve.
A hardware failure is mainly due to physical
effects.
The failed system is repaired by performing
corrective maintenance.
Interfaces are visual.
The hardware reliability field is well established,
particularly in the area of electronics.
Obtaining good failure data is a problem.
Usually redundancy is effective.
Potential for monetary savings.
Preventive maintenance is performed to inhibit
failures.
It is possible to repair hardware by using spare
modules.
Hardware reliability has well-established
mathematical concepts and theory.
Has a hazard function.
Mean time to repair has significance.
3
4
5
6
7
8
9
10
11
12
13
14
15
to a dropped bit. Nowadays, memory parity errors are very rare because of an
impressive improvement in hardware reliability and also they are not necessarily
fatal. Mysterious failures occur unexpectedly and thus in real-time systems such
failures are never properly categorized. For example, when a normally functioning
system stops operating suddenly without indicating any problem (i.e., software,
hardware, etc.), the malfunction is called the mysterious failure.
errors/faults introduced in the test phase significantly influence the test programs
capability to uncover faults in the software and those errors/faults include test
plans/procedures that have wrongly interpreted software requirements and
errors/faults in code written for the test program.
In the acceptance testing phase, the test team determines if the product under
consideration meets its original requirements. The acceptance test plan is developed
and the phase terminates when all the tests in the acceptance plan are successfully
executed.
In the maintenance and operation phase, primarily the attention is focused on
rectifying errors that appear in the exercise of the software or on fine-tuning the
software to improve its performance.
In the retirement phase, the user may decide not to use the software any more
and discard it altogether. The primary reason is that the software may have become
difficult to maintain because of recurrent changes.
The consensus recovery block method combines attributes of the preceding two
approaches; thus, it attempts to discard the weaknesses of those two methods.
Nonetheless, this approach requires the development of N-versions of a software
and of an acceptance test voting procedure. The reliability factor is used to rank the
different versions of the software after execution of all software versions. The
resulting outputs are submitted to the voting mechanism. If no agreement is reached,
the order of reliability is used in submitting each output successively to the acceptance test. As soon as one of the resulting outputs passes the test, the process
terminates and the software under consideration continues with its operation. All in
all, it may be said that this method is more reliable than recovery-block design and
N-version programming approaches.
12.6.4 TESTING
This may simply be described as the process of executing a software program to
uncover errors. There are many different types of software testing including module
testing, top-down testing, bottom-up testing, and sandwich testing [8, 2931]. Each
of these testing methods is discussed below.
Module Testing. This is concerned with testing each module usually in
isolation from the rest of the system, but subject to the environments to
be experienced in the system. Usually, a general module test program is
developed instead of writing an entirely new program for each module to
be tested. Because it is easy to correct uncovered errors and more grave
consequences of errors can be discovered at a later stage, it is advisable
to conduct module testing thoroughly [32].
Top-Down Testing. This is concerned with integrating and testing the
software program from the top end to the bottom end and in the program
structure, the top module is the only module that is unit tested in isolation.
At the conclusion of the top module testing, this module calls the remaining modules one by one to merge with it. In turn, each combination is
tested and the process continues until the completion of combining and
testing of all the modules [31]. The advantages of the top-down testing
are efficiency in locating problem areas at the higher program levels, easy
representation of test cases after adding the input/output functions, and
early demonstrations because of the early skeletal program. Similarly,
some of the drawbacks of the top-down testing are that stub modules must
be written, stub modules may be complex, test conditions could be impossible or difficult to create, and the representation of test cases in stubs can
be difficult prior to adding the input/output functions [12].
Bottom-Up Testing. This is concerned with integrating and testing the
software program from the bottom end to the top end and the terminal
modules are module (unit) tested in isolation. These modules do not call
any other modules and at the conclusion of the terminal module testing,
the modules that directly call these tested modules are the ones in line
for testing. The modules are not tested in isolation; in fact, the testing is
conducted together with the earlier tested lower level modules. The process continues until reaching the top of the program. Some of the benefits
of the bottom-up testing are ease in creating test conditions, efficiency in
locating problem areas at the lower levels of the program, and ease in
observating test results. Similarly, the bottom-up testing drawbacks
include: driver modules must be produced, and the program does not exist
as an entity until the adding of the final module [12].
Sandwich Testing. This method is the result of combining the top-down
testing and bottom-up testing methods. Obviously, the idea behind developing this approach was to extract benefits of both top-down testing and
bottom-up testing and eliminate some of their drawbacks. As both topdown testing and bottom-up testing are performed simultaneously, the
program is integrated from both sides (i.e., top and bottom). Even though
the resulting meeting point from both sides depends upon the program
being tested, it can be predetermined by reviewing the structure of the
program.
assess reliability of software. FMEA and FTA are described in detail in Chapters 6 and
7, respectively.
cd =
(12.1)
i =1
where
cd is the cumulative defect ratio for design.
is the total number of reviews.
Ni is the total number of unique defects, at or above a given severity level,
discovered in the ith design review.
L is the number of source lines of design statement expressed in thousands,
in the design phase.
In the event of having estimates of defect density greater than the ones for
comparable projects, review the development process to determine if poor training/practices are responsible or if the requirements are ambiguous or incomplete.
Under such circumstance, it may be the correct course to delay development until
such time the corrective measures can be taken. In contrast, if estimates of defect
density are less than for comparable projects, review the methodology and the review
process itself. If the assessed review process is considered to be satisfactory, it is
quite reasonable to conclude that the development phases are generating low-defect
software products.
Code and Unit Test Phase Measure
For this phase, another form of the defect density measure is more appropriate and
again this form requires the establishment of defect severity classifications. Thus,
the cumulative defect ratio for code is expressed by
cd =
(12.2)
SL
i =1
where
cd is the cumulative defect ratio for code.
is the total number of reviews.
Mi is the total number of unique defects, at or above a given severity level,
discovered in the ith code review.
SL is the number of source lines of code reviewed, expressed in thousands.
TABLE 12.2
Classification of Software Reliability Models
No.
Classification
Fault seeding
II
Failure count
III
IV
Description
This incorporates those models that determine the number of faults in
the program at zero time via seeding of extraneous faults.
This includes models counting the number of failures/faults occurring
in given time intervals.
This incorporates models providing the time between failure
estimations.
This incorporates models that determine the program/software
reliability under the circumstance the test cases are sampled randomly
from a known operational distribution of inputs to the
program/software.
examples of the models belonging to this classification are the Jelinski and Moranda
model [40] and the Schick and Wolverton model [21].
The classification IV models, i.e., input domain based models, have three key
assumptions: (1) inputs selected randomly, (2) input domain can be partitioned into
equivalence groups, and (3) known input profile distribution. Two examples of this
classification model are the Nelson model [42] and the Ramamoorthy and Bastani
model [43].
Some of the software reliability models are presented below.
Mills Model
This is a different and more pragmatic approach to software reliability prediction
proposed by Mills [37] in 1972. Mills argues that an assessment of the faults remaining in a given software program can be made through a seeding process that makes
an assumption of a homogeneous distribution of a representative category of faults.
Prior to starting the seeding process, a fault analysis is required to determine the
expected types of faults in the code and their relative occurrence frequency. Nonetheless, an identification of both seeded and unseeded faults is made during reviews
or testing and the discovery of seeded and indigenous faults allows an assessment
of remaining faults for the fault type under consideration. It is to be noted that the
value of this measure can only be estimated, provided the seeded faults are discovered.
The maximum likelihood of the unseeded faults is expressed by [12]
N1 = Nsf m fu msf
(12.3)
where
N1
Nsf
mfu
msf
is
is
is
is
the
the
the
the
(12.4)
Example 12.1
A software program under consideration was seeded with 35 faults and, during
testing, 60 faults of the same type were found. The breakdowns of the faults uncovered were 25 seeded faults and 35 unseeded faults. Estimate the number of unseeded
faults remaining in the program.
Substituting the given data into Equation (12.3) yields
N1 = (35) (35) 25 = 49 faults
Inserting the above result and the other specified data value into Equation (12.4),
we get
N = 49 35 = 14 faults
It means 14 unseeded faults still remain in the program.
Musa Model
This model belongs to classification II models of Table 12.2 and is based on the
premise that reliability assessments in the time domain can only be based upon
actual execution time, as opposed to calender or elapsed time. The reason for this
is that only during execution does the software program become exposed to failureprovoking stress. Nonetheless, this model should only be used after the completion
of integration or, more specifically, when all the relevant modules are in place [44].
Some of the important assumptions associated with the model are as follows:
Execution time between failures is piecewise exponentially distributed.
Failure intervals are statistically independent and follow Poisson distribution.
Failure rate is proportional to the remaining defects.
A comprehensive list of assumptions may be found in Reference 8. Musa developed the following simplified equation to obtain the net number of corrected faults:
m = M 1 exp ( kt M Ts )
(12.5)
where
m
M
Ts
t
k
(12.6)
(12.7)
From the above relationships, we obtain the number of failures that must occur to
improve mean time to failure from, say, T1 to T2 [44]:
1
1
m = M Ts
T1 T2
(12.8)
M Ts
ln (T2 T1 )
k
(12.9)
Example 12.2
A newly developed software is estimated to have approximately 450 errors. Also,
at the beginning of the testing process, the recorded mean time to failure is 4 h.
Determine the amount of test time required to reduce the remaining errors to 15, if
the value of the testing compression factor is 5. Estimate reliability over a 100-h
operational period.
Substituting the given data into Equation (12.8), we get
1
4
1
T2
(12.10)
= 1224.4 h
Thus, for the given and calculated values from Equation (12.7), we get
R (100) = exp ( 100 120) = 0.4346
Thus, the required testing time is 1224.4 h and the reliability of the software for
the specified operational period is 0.4346.
Shooman Model
This is one of the earliest software reliability models and has influenced the development of several others over the time period [43, 44]. The model does not require
fault collection during debugging on a continuous basis and it can be used for
software programs of all sizes. The key assumptions associated with this model are
as follows [8]:
Total machine instructions remain constant.
Debugging does not introduce new errors.
The hazard function is proportional to residual or the remaining software
errors.
The residual errors are obtained by subtracting the number of cumulative
rectified errors from the total errors initially present.
The total number of errors in the software program remains constant at the
start of integration testing and they decrease directly as errors are corrected.
The models hazard function is expressed by
(t ) = C Fr (x)
(12.11)
where
(t)
t
C
x
Fr (x)
Fz
Fc (x)
I
(12.12)
where
Fz
is the number of initial faults at time x = 0.
I
is the total number of machine language instructions.
Fc (x) is the cumulative number of faults corrected in interval x.
Inserting Equation (12.12) into Equation (12.11) yields
F
(t ) = C z Fc (x)
I
(12.13)
R (t ) = exp (t ) d t
(12.14)
(12.15)
By integrating Equation (12.15) over the interval [0, ], the following expression
for mean time to failure (MTTF) results:
F
MTTF = exp C z Fc (x) t d t
I
0
(12.16)
= 1 C z Fc (x)
I
In order to estimate the constants, C and Fz, we use the maximum likelihood
estimation approach to get [44-46]
=
C
i =1
Mi
i =1
F z
Fc (x i ) Wi
I
(12.17)
and
=
C
i =1
Mi
F z
Fc (x i )
I =1
(12.18)
where
N (N 2) is the number of tests following the debugging intervals (0, x1),
(0, x2), , (0, xN).
Wi
is the total time of successful and unsuccessful (i.e., all) runs in
the ith test.
is the total number of runs terminating in failure in the ith test.
Mi
Power Model
This model is also known as the Duane model because Duane [47] originally proposed it for hardware reliability in 1964. For software products, the same behavior
has been observed. The reason for calling it a power model is that the mean value
function, m(t), for the cumulative number of failures by time t is taken as a power
of t, i.e.,
m (t ) = t , for > 0, > 0
(12.19)
For = 1, we get the homogeneous Poisson process model. The key assumption
associated with this model is that the cumulative number of failures by time t, N (t),
follows a Poisson process with value function described by Equation (12.19). In
order to implement the model, the data requirement could be either of the
following [48]:
Elapsed times between failures, i.e., y1, y2, y3, , yn, where yi = t i t i1
and t 0 = 0.
Actual times the software program failed, i.e., t1, t2, t3, , tn.
If T is the time for which the software program was under observation, then we can
write
m (T) T Expected number of faults by T
=
=
T
T
Total testing time, T
(12.20)
(12.21)
The above equation plots as a straight line and it is the form fitted to given data.
Differentiating Equation (12.19) with respect to t we get the following expression
for the failure intensity function:
d m (t )
= t 1 = (t )
dt
(12.22)
For > 1, Equation (12.22) is strictly increasing; thus, there can be no growth in
reliability [48].
Using the maximum likelihood estimation method, we get [49]
= n t n
(12.23)
and
n 1
= n
ln (t
i =1
ti )
(12.24)
The maximum likelihood estimation for the MTTF [i.e., for the (n + 1)st failure] is
given by [48, 49]
= t n n
MTTR
(12.25)
(12.26)
where
DA
NA
T
PS
C
M
LT
ER
QR
SR
AM
is
is
is
is
is
is
is
is
is
is
is
is
(12.27)
where
LSC is the number of lines of source code.
LEF is the programs linear execution frequency.
FER is the fault expose ratio (1.4 x 10-7 FER 10.6 10-7).
The linear execution frequency, LEF, of the program is expressed by
LEF = (MIR) (OIP)
where
MIR is the mean instruction rate.
OIP is the number of object instructions in the program.
(12.28)
(12.29)
where
SI is the number of source instructions.
CER is the code expansion ratio. More specifically, this is the ratio of
machine instructions to source instructions and, normally, its average
value is taken as 4.
The number of inherent faults, IF, is expressed by
IF = () ( LSC)
(12.30)
(12.31)
R TMRV = 3 R m2 2 R m3 R V
(12.32)
where
RTMRV is the reliability of the TMR system with voter.
RV
is the voter reliability.
Rm
is the module/unit reliability.
With perfect voter, i.e., RV = 1, Equation (12.32) reduces to
R TMR = 3 R m2 2 R m3
(12.33)
where
RTMR is the reliability of the TMR system with perfect voter.
The improvement in reliability of the TMR system over a single unit system is
determined by the single units reliability and the reliability of the voter. In the case
of perfect voter, the reliability of the TMR system given by Equation (12.33) is only
better than the single unit system when the reliability of the single unit or module
is greater than 0.5. At RV = 0.9, the reliability of the TMR system is only marginally
better than the single unit system reliability when the reliability of the single unit
is approximately between 0.667 and 0.833 [12]. Furthermore, when RV = 0.8, the
TMR system reliability is always less than a single units reliability.
For the single unit systems reliability, RS = Rm, and the perfect voter, i.e., RV = 1,
the ratio of RTMR to Rc is expressed by [3]
r=
R TMR 3 R m2 2 R m3
=
= 3 R m 2 R m2
RS
Rm
(12.34)
(12.35)
0.84
and
r = 3 (0.75) 2 (0.75)
= 1.125
Example 12.3
The reliability of the TMR system with perfect voter is given by Equation (12.33).
Determine the point where both the single unit or simplex system reliability is equal
to the TMR system reliability. Assume that the simplex system reliability is given by
Rs = R m
(12.36)
(12.37)
2 R m2 3 R m + 1 = 0
(12.38)
Therefore,
Rm =
3 + [9 4 (2)(1)]
2 (2 )
12
=1
and
3 [9 4 (2)(1)]
Rm =
2 (2 )
12
=1 2
Thus, at Rm = 1, the reliability of the simplex and TMR systems is the same.
This means that the TMR system reliability will only be better than the reliability
of the simplex system when Rm > 0.5.
Example 12.4
Assume that the reliability of the single unit in Example 12.3 is time dependent and
is expressed by
R m (t ) = e m t
(12.39)
where
t is time.
m is the unit/module constant failure rate.
Determine the point where both the single unit or simplex system reliability is equal
to the TMR system reliability.
Since from Example 12.3 we have Rm = 1, we can write
em t = 1
(12.40)
em t = 1 2
(12.41)
and
that is
m t = 0.6931
Equations (12.40) and (12.41) indicate that at t = 0 or m = 0 and at m t = 0.6931,
respectively, the reliability of the simplex and TMR systems is the same. The
reliability of the TMR system will only be better then the simplex system reliability
when the value of the t is less than 0.6931.
TMR System Time Dependent Reliability and Mean Time
to Failure (MTTF)
For constant failure rates of units, the TMR system with voter reliability using
Equation (12.32) is
R TMRV (t ) = 3 e 2 m t 2 e 3 m t e V t
=3e
(2 m + V ) t
2 e
(3 m + V ) t
(12.42)
where
t
RTMRV
m
V
is
is
is
is
time.
the TMR system with voter reliability at time t.
the module/unit constant failure rate.
the voter constant failure rate.
Integrating Equation (12.42) over the interval from 0 to , we get the following
expression for the MTTF of the TMR system with voter:
MTTFTMRV =
[3 e
(2 m + v ) t
3 + t
2 e ( m v ) dt
(12.43)
(2 m + V ) (3 m + V )
For the perfect voter, i.e., V = 0, Equations (12.42) and (12.43) simplify to
R TMR (t ) = 3 e 2 m t 2 e 3 m t
(12.44)
and
MTTFTMR =
3
2
5
=
2 m 3m 6 m
(12.45)
where
RTMR (t) is the TMR system with perfect voter reliability at time t.
MTTFTMR is the MTTF of the TMR system with perfect voter.
Example 12.5
Assume that the failure rate of a unit/module belonging to an independent TMR
system with voter is m = 0.0001 failures/106 h. Calculate the TMR system MTTF,
if the voter failure rate is V = 0.00002 failures/106 h. Also, compute the system
reliability for a 100-h mission.
Inserting the specified data into Equation (12.43) yields
MTTFTMRV =
3
2
7, 386 h
0.9977
Thus, the reliability and MTTF of the TMR system with voter are 0.9977 and
7386 h, respectively.
Example 12.6
Repeat the Example 12.5 calculations for a TMR system with perfect voter. Comment
on the end results.
Thus, in this case we have V = 0.
Substituting the remaining given data into Equations (12.44) and (12.45), we get
R TMR (100) = 3 e ( 2 )( 0.0001)(100 ) 2 e (3)(. 0 0001)(100 )
0.9997
and
MTTFTMR =
5
6 (0.0001)
8, 333 h
In this case, the reliability and MTTF of the TMR system with perfect voter are
0.9997 and 8333 h, respectively. The perfect voter helped to improve the TMR
system reliability and MTTF, as can be observed by comparing Examples 12.5 and
12.6 results.
TMR System with Repair and Perfect Voter Reliability Analysis
In the preceding TMR system reliability analysis, all units were considered nonrepairable. This is quite true in certain applications like space exploration but in
others it is possible to repair the failed units. Thus, in this case we consider an
independent and identical unit repairable TMR system with perfect voter. As soon
as any one of the TMR system units fails, it is immediately repaired. When more
than one unit fails, the TMR system is not repaired. The TMR system state space
diagram is shown in Figure 12.3.
Markov technique is used to develop state probability equations [8] for the state
space diagram shown in Figure 12.3. The following assumptions are associated with
this TMR system model:
is the jth state of the TMR system shown in Figure 12.3; i = 0 (three
units up), i = 1 (one unit failed system up), i = 2 (two units failed
system failed).
Pj (t) is the probability that the TMR system is in state j at time t; for j = 0,
1, and 2.
m is the unit/module failure rate.
(12.46)
P1 (t ) = 3 m P0 (t ) (2 + ) P1 (t )
(12.47)
P2 (t ) = 2 m P1 (t )
(12.48)
where the prime denotes differentiation with respect to time t. At time t = 0, P0 (0) = 1
and P1 (0) = P2 (0) = 0.
By solving Equations (12.46) through (12.48), we get the following expression for
the TMR system reliability, with repair and perfect voter:
R TMRr (t ) = P0 (t ) + P1 (t )
=
) ]
1
(5 m + ) ex1 t ex2 t + x1 ex1 t x2 ex2 t
x1 x 2
(12.49)
where
x1, x 2 = (5 m + ) 2 + 2 + 10
12
(12.50)
MTTFTMRr =
TMRr
(t ) d t
(12.51)
5
=
+
6 m 6 2m
where
MTTFTMRr is the TMR system, with repair and perfect voter, mean time to
failure.
For no repair facility, i.e., = 0, Equation (12.51) becomes the same as Equation
(12.45).
Example 12.7
A unit/module of an independent TMR system, with repair and perfect voter, has a
failure rate of 0.005 failures/h. The unit repair rate is 0.6 repairs/h. Calculate the
following:
System with repair mean time to failure.
System mean time to failure, if there is no repair.
Comment on the end results.
Substituting the given data into Equation (12.51) yields
MTTFTMRr =
5
0.6
+
2
6 (0.005) 6 (0.006)
4167 h
Similarly, for no repair, inserting the given data into Equation (12.45) yields
MTTFTMR =
5
6 (0.005)
167
It means the introduction of repair helped to increase the TMR system MTTF
from 167 h to 4167 h.
R NMRV
m N
i
Ni
= RV
R m (1 R m )
i=0 i
N
N ! ( N i )! i !
i
(12.52)
(12.53)
where
RNMRV is the reliability of the NMR system with voter.
RV
is the voter reliability.
Rm
is the module reliability.
The time dependent and other reliability analysis of this system can be performed
in a manner similar to the TMR system analysis. Additional redundancy schemes
may be found in Reference 8.
12.9 PROBLEMS
1. Make a comparison of hardware reliability and software reliability.
2. What are the major sources of computer failures? Describe them in detail.
3. Define the following terms:
Software error
Debugging
Software reliability
Fault-tolerant computing
12.10 REFERENCES
1. Keene, S.J., Software reliability concepts, Annu. Reliability and Maintainability
Symp. Tutorial Notes, 1-21, 1992.
2. Pradhan, D.K., Ed., Fault-Tolerant Computing Theory and Techniques, Vols. 1 and
2, Prentice-Hall, Englewood Cliffs, NJ, 1986.
3. Shooman, M.L., Fault-tolerant computing, Annu. Reliability and Maintainability
Symp. Tutorial Notes, 1-25, 1994.
4. Shannon, C.E., A mathematical theory of communications, Bell System Tech. J., 27,
379-423 and 623-656, 1948.
5. Hamming, W.R., Error detecting and error correcting codes, Bell System Tech. J., 29,
147-160, 1950.
6. Von Neumann, J., Probabilistic logics and the synthesis of reliable organisms from
reliable components, in Automata Studies, Shannon, C.E. and McCarthy, J., Eds.,
Princeton University Press, Princeton, NJ, 1956, pp. 43-98.
7. Moore, E.F. and Shannon, C.E., Reliable circuits using less reliable relays, J. Franklin
Inst., 262, 191-208, 1956.
8. Dhillon, B.S., Reliability in Computer System Design, Ablex Publishing, Norwood,
NJ, 1987.
9. Haugk, G., Tsiang, S.H., and Zimmerman, L., System testing of the no. 1 electronic
switching system, Bell System Tech. J., 43, 2575-2592, 1964.
10. Floyd, R.W., Assigning meanings to program, Math. Aspects Comp. Sci., XIX, 19-32,
1967.
11. Hudson, G.R., Programming Errors as a Birth-and-Death Process, Report No. SP3011, System Development Corporation, 1967.
12. Pecht, M., Ed., Product Reliability, Maintainability, and Supportability Handbook,
CRC Press, Boca Raton, FL, 1995.
13. Lipow, M., Prediction of software failures, J. Syst. Software, 1, 71-75, 1979.
14. Anderson, R.T., Reliability Design Handbook, Rome Air Development Center, Griffiss Air Force Base, Rome, NY, 1976.
15. Glass, R.L., Software Reliability Guidebook, Prentice-Hall, Englewood Cliffs, NJ,
1979.
16. Avizienis, A., Fault-tolerant computing: An overview, Computer, January/February,
5-8, 1971.
17. Kline, M.B., Software and hardware reliability and maintainability: What are the
differences?, Proc. Annu. Reliability and Maintainability Symp., 179-185, 1980.
18. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Handbook of Reliability Engineering
and Management, McGraw-Hill, New York, 1996.
19. Dhillon, B.S., Reliability Engineering in Systems Design and Operation, Van Nostrand Reinhold Company, New York, 1983.
20. Yourdon, E., The causes of system failures Part II, Modern Data, 5, 50-56, 1972.
21. Yourdon, E., The causes of system failures Part III, Modern Data, 5, 36-40, 1972.
22. Bell, D., Morrey, L., and Pugh, J., Software Engineering: A Programming Approach,
Prentice-Hall, London, 1992.
23. Wang, R.S., Program with measurable structure, Proc. Am. Soc. Quality Control
Conf., 389-396, 1980.
24. Koestler, A., The Ghost in the Machine, Macmillan, New York, 1967.
25. Neufelder, A.M., Ensuring Software Reliability, Marcel Dekker, New York, 1993.
26. Scott, R.K., Gault, J.W., and McAllister, D.G., Fault tolerant software reliability
modeling, IEEE Trans. Software Eng., 13(5), 1987.
27. Neuhold, E.J. and Paul, M., Formal description of programming concepts, Int. Fed.
Info. Process. (IFIP) Conf. Proc., 310-315, 1991.
28. Galton, A., Logic as a formal method, Comp. J., 35(5), 213-218, 1992.
29. Beizer, B., Software System Testing and Quality Assurance, Van Nostrand Reinhold
Company, New York, 1984.
30. Myers, G.J., The Art of Software Testing, John Wiley & Sons, New York, 1979.
31. Myers, G.J., Software Reliability: Principles and Practices, John Wiley & Sons, New
York, 1976.
32. Kopetz, H., Software Reliability, Macmillan, London, 1979.
33. Musa, J.D., Iannino, A., and Okumoto, K., Software Reliability, McGraw-Hill, New
York, 1987.
34. Sukert, A.N., An investigation of software reliability models, Proc. Annu. Reliability
Maintainability Symp., 478-484, 1977.
35. Schick, G.J. and Wolverton, R.W., An analysis of competing software reliability
models, IEEE Trans. Software Eng., 4, 104-120, 1978.
36. Hudson, G.R., Program Errors as a Birth and Death Process, Report No. SP-3011,
System Development Corporation, December 4, 1967.
37. Mills, H.D., On the Statistical Validation of Computer Programs, Report No. 72-6015,
1972. IBM Federal Systems Division, Gaithersburg, MD.
38. Musa, J.D., A theory of software reliability and its applications, IEEE Trans. Software
Eng., 1, 312-327, 1975.
39. Shooman, M.L., Software reliability measurement and models, Proc. Annu. Reliability Maintainability Symp., 485-491, 1975.
40. Jelinski, Z. and Moranda, P.B., Software reliability research, in Proceedings of the
Statistical Methods for the Evaluation of Computer System Performance, Academic
Press, 1972, pp. 465-484.
41. Schick, G.J. and Wolverton, R.W., Assessment of software reliability, Proceedings of
the Operations Research Physica-Verlag, Wurzburg-Wien, 1973, pp. 395-422.
42. Nelson, E., Estimating software reliability from test data, Microelectronics and Reliability, 17, 67-75, 1978.
43. Ramamoorthy, C.V. and Bastani, F.B., Software reliability: status and perspectives,
IEEE Trans. Software Eng., 8, 354-371, 1982.
44. Dunn, R. and Ullman, R., Quality Assurance for Computer Software, McGraw-Hill,
New York, 1982.
45. Craig, G.R., Software Reliability Study, Report No. RADC-TR-74-250, Rome Air
Development Center, Griffiss Air Force Base, Rome, NY, 1974.
46. Thayer, T.A., Lipow, M., and Nelson, E.C., Software Reliability, North-Holland
Publishing Company, New York, 1978.
47. Duane, J.T., Learning curve approach to reliability monitoring, IEEE Trans. Aerospace, 2, 563-566, 1964.
48. Lyu, M.R., Ed., Handbook of Software Reliability Engineering, McGraw-Hill, New
York, 1996.
49. Crow, L.H., in Reliability Analysis for Complex Repairable Systems, Reliability and
Biometry, Proschan, F. and Serfling, R.J., Eds., Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1974, pp. 379-410.
50. Methodology for Software Reliability Prediction and Assessment, Report No. RLTR-92-52, Volumes I and II, Rome Air Development Center, Griffiss Air Force Base,
Rome, NY, 1992.
51. Mathur, F.P. and Avizienis, A., Reliability analysis and architecture of a hybrid
redundant digital system: Generalized triple modular redundancy with self-repair,
1970 Spring Joint Computer Conference, AFIPS Conf. Proc., 36, 375-387, 1970.
13
Robot Reliability
and Safety
13.1 INTRODUCTION
Robots are increasingly being used in the industry to perform various types of tasks:
material handling, arc welding, spot welding, etc.
The history of robots/automation may be traced back to the ancient times when
the Egyptians built water-powered clocks and the Chinese and Greeks built waterand steam-powered toys. Nonetheless, the idea of the functional robot originated in
Greece in the writings of Aristotle (4th century B.C.), the teacher of Alexander The
Great, in which he wrote: If every instrument could accomplish its own work,
obeying or anticipating the will of others [1].
In 1920, Karl Capek (18901938), a Czechoslovak science-fiction writer first
coined the word robot and used it in his play entitled Rossums Universal Robots
which opened in London in 1921. In the Czechoslovakian language, robot means
worker. In 1939, Isaac Asimov wrote a series of stories about robots and in 1942,
he developed the following three laws for robots [2]:
A robot must not harm a human, nor, through inaction, permit a human
to come to harm.
Robots must obey humans unless their orders conflict with the preceding
law.
A robot must safeguard its own existence unless it conflicts with the
preceding two laws.
In 1954, George Devol [3] developed a programmable device that could be
considered the first industrial robot. Nonetheless, in 1959, the first commercial robot
was manufactured by the Planet Corporation and Japan, today the world leader in
the use of robots, imported its first robot in 1967 [3]. In 1970, the first conference
on industrial robots was held in Chicago, Illinois, and five years later (i.e., in 1975),
the Robot Institute of America (RIA) was founded. Today, there are many journals
and a large number of publications on robots.
In 1983, the estimated world robot population was around 30,000 [4] and by the
end of 1998, it is forecasted to be around 820,000 [5].
A robot may be defined as a reprogrammable multi-functional manipulator
designed to move material, parts, tools, or specialized items through variable programmed motions for the performance of various tasks [6]. As a robot has to be safe
and reliable, this chapter discusses both of these topics.
TABLE 13.1
National Institute for Occupational Safety and Health Recommendations
for Minimizing Robot Related Incidents
Recommendation
No.
1
2
3
4
5
6
Recommendation description
Allow sufficient clearance distance around all moving parts of the robot system.
Provide adequate illumination in control and operational areas of robot system.
Provide physical barrier incorporating gates with electrical interlocks, so that when
any one gate opens, the robot operation is terminated.
Ensure that floors or working surfaces are marked so that the robot work area is
clearly visible.
Provide back-up to devices such as light curtains, motion sensors, electrical
interlocks, or floor sensors.
Provide remote diagnostic instrumentation so that system maximum
troubleshooting can be conducted from outside the robot work area.
Subsequent investigation revealed that in all of the above cases, the humans put
themselves in harms way. The robot alone did not kill the human.
Some of the causal factors for robot accidents include robot design, workplace
design interfacing, and workplace design guarding. In order to minimize the risk of
robot incidents, the National Institute for Occupational Safety and Health proposed
various recommendations [15]. In fact, all of these recommendations were directed
at three strategic areas: (1) worker training, (2) worker supervision, and (3) robot
system design. The robot system design related recommendations are presented in
Table 13.1
AND
SAFETY PROBLEMS
There are various hazards associated with robots. The three basic types of hazards
are shown in Figure. 13.1 [16, 17]. These are impact hazards, the trapping point
hazards, and those that develop from the application itself. The impact hazards are
concerned with an individual(s) being struck by a robots moving parts, by items
being carried by the robot, or by flying objects ejected or dropped by the robot. The
trapping point hazards usually occur because of robot movements with respect to
fixed objects such as machines and posts in the same space. Furthermore, other
possible factors include the movements of auxiliary equipment such as work carriages and pallents. The hazards that are generated by the application itself include
burns, electric shocks, exposure to toxic substances, and arc flash.
There could be many different causes of the above three basic types of robot
hazards but the most prevalent causes include unauthorized access, human error, and
mechanical related problems from the robot system itself or from the application.
IN
A robot life cycle could be divided into four distinct phases: (1) design, (2) installation, (3) programming, and (4) operation and maintenance. The overall robot safety
problems can only be minimized if careful consideration is given to safety throughout
the robot life cycle. Some safety measures that could be taken during each of the
four robot life cycle phases are discussed below [2, 7, 1821]:
Design Phase
The safety measures associated with this phase may be categorized into three distinct
groups: electrical, software, and mechanical. The electrical group safety measures
include eliminating the risk of an electric shock, minimizing the effects of electromagnetic and radio-frequency interferences, designing wiring circuitry capable of
stopping the robots movement and locking its brakes, and having built-in hose and
cable routes using adequate insulation, sectionalization, and panel covers. Some of
the safety measures belonging to the software group include having built-in safety
commands, periodically examining the built-in self-checking software for safety,
having a stand-by power source for robots functioning with programs in random
access memory, using a procedure of checks for determining why a failure occurred,
and prohibiting a restart by merely resetting a single switch.
The mechanical group related safety measures include providing dynamic brakes
for a software crash or power failure, providing drive mechanism covers, providing
several emergency stop buttons, providing mechanisms for releasing the stopped
energy, and putting guards on items such as gears, belts, and pulleys.
Installation Phase
There are a large number of safety measures that can be taken during the robot
installation phase. Some of these measures include installing interlocks, sensing
devices, etc.; distancing circuit boards from electromagnetic fields; installing electrical
cables according to electrical codes; identifying the danger areas with the aid of codes;
placing robot controls outside the hazard zone; providing an appropriate level of
illumination to humans concerned with the robot; and labeling stored energy sources.
Programming Phase
During this phase, programmers/setters are the people particularly at risk. For example, according to a study of 131 cases [21], such people were at the highest risk
57% of the time. Some of the safety measures that can be taken during the programming phase include designing the programmer work area so that unnecessary stress
with respect to stand accessibility, visibility/lighting, and forced postures, is eliminated; installing hold-to-run buttons; turning off safety-related devices with a key
only; planning the programmers location outside the movement only; and marking
the programming position.
Operation and Maintenance Phase
There are a large number of safety related measures that belong to this phase. Some
of these are developing appropriate safety operations and maintenance procedures,
ensuring functionality of all emergency stops, providing appropriate protective gear
to all concerned individuals, minimizing the potential energy of an unexpected
motion by having the robot arm extended to its maximum reach position, blocking
out all concerned power sources during maintenance, posting the robots operating
weight capacity, observing all government codes and other regulations concerned
with robot operation and maintenance, and ensuring that only authorized and trained
personnel operate and maintain robots.
Intelligent Systems
This approach makes use of intelligent control systems that make decisions through
remote sensing, hardware, and software. In order to have an effective intelligent
collision-avoidance system, the robot operating environment has to be restricted.
Also, special sensors and software should be used. Nonetheless, usually in most
industrial settings it is not possible to restrict environment.
Electronic Devices
This approach makes use of ultrasonic for parameter control to seek protection from
intrusions. The parameter control electronic barriers use active sensors for intrusion
detection. Usually, the use of ultrasonic is considered in those circumstances where
unobstructed floor space is a critical issue. An evaluation of robot parameter control
devices is presented in Reference 23.
Infrared Light Arrays
This approach makes use of linear arrays of infrared sources known as light curtains.
The light curtains are generally reliable and provide an excellent protection to
individuals from potential dangers in the robots operating area. False triggering is
the common problem experienced with the use of light curtains. It may occur due
to various factors including smoke, flashing lights, or heavy dust in situations where
the system components are incorrectly aligned.
Physical Barriers
Even though the use of physical barriers is an important approach to safeguard humans,
it is not the absolute solution to a safety problem. The typical examples of physical
barriers include safety rails, plastic safety chains, chain-link fences, and tagged-rope
barriers. Some of the useful guidelines associated with physical barriers include:
Consider using safety rails in areas free of projectiles.
Consider using fences in places where long-range projectiles are considered a hazard.
Safety rails and chain-link fences are quite effective in places where
intrusion is a serious problem.
When considering the use of a peripheral physical barrier, it is useful to
ask questions such as the following:
What is being protected?
Is it possible that it can be bypassed?
How effective is the barrier under consideration?
How reliable is the barrier under consideration?
IN
ROBOTIC SAFETY
One of the single most important components in robotic safety is the issue of human
factors. Just like in the case of other automated machines, the problem of human
factors in robotics is very apparent and it has to be carefully considered. This section
presents human factor aspects of robotic safety in five distinct areas as shown in
Figure 13.2 [24]. These are human-robot interface design, documentation preparation, methods analysis, future considerations, and miscellaneous considerations.
The human-robot interface design is one important area which requires careful
consideration of human factors. Here, the primary objective is to develop humanrobot interface design so that the probability of the human error occurrence is at
minimum. There are various steps that help to prevent the occurrence of human
error, including analyzing human actions during robotic processes, designing hardware and software with the intention of reducing human errors, paying careful
attention to layouts, and considering factors such as weight, layout of buttons,
readability of the buttons functional descriptions, connectors flexibility, and handheld devices shape and size.
The document preparation is concerned with the quality of documentation for
use by robot users. These documents must be developed by considering the qualifications and experience of the target group. Some examples of target groups are
operators, maintainers, programmers, and designers. Nonetheless, during the preparation of such documents factors, such as easily understandable information, inclusion of practical exercises, completeness of information, and inclusion of pictorial
descriptions at appropriate places must be carefully considered.
The classical industrial engineering approach to methods analysis is found to
be quite effective in improving robotic safety with respect to human factors. Flow
process charts and multiple-activity process charts are two important examples.
Some of the future considerations to improve robotic safety with respect to
human factors include the application of artificial intelligence to the worker-robot
interface, and considering human factors in factory design with respect to possible
robot applications.
13.3.1 CAUSES
AND
CLASSIFICATIONS
OF
ROBOT FAILURES
Over the years, there have been various studies performed to determine the causes
of robot failures. As a result of such studies, some of the frequently occurring failure
causes highlighted were oil pressure valve problems, servo valve malfunctions, noise,
printed circuit board troubles, and human errors. In particular, References 28 and
29 pointed out that the robot problems occurred in the following order: control
system problems, jig and other tool incompatibility, robot body problems, programming and operation errors, welding gun problems and problems of other tooling
parts, deterioration and precision deficiency, runaway, and miscellaneous.
A robots reliability and its safe operation are primarily affected by four types
of failures [30, 31]: (1) random component failures, (2) systematic hardware faults,
(3) software faults, and (4) human errors.
The failures that occur unpredictably during the useful life of a robot are known
as random failures. The reasons for the occurrence of such failures include undetectable defects, unavoidable failures, low safety factors, and unexplainable causes.
The systematic hardware faults occur because of the existence of unrevealed mechanisms in the robot design. One example of the protection against robot hardware
faults is to use sensors in the system to detect the loss of pneumatic pressure/line
voltage/hydraulic pressure. Reference 32 presents various useful techniques to
reduce the occurrence of systematic hardware faults/failures.
The occurrence of software faults is an important factor in the malfunctioning
of robots and they may occur due to reasons such as embedded software or the
controlling software and application software. Some studies have revealed that over
60% of software errors are made during the requirement and design phases as
opposed to less then 40% during the coding phase. The measures such as the
performance of failure mode and effects analysis (FMEA), fault tree analysis, and
testing help to reduce robot software faults.
During the design, manufacture, test, operation, and maintenance of a robot,
various types of human errors may occur. Some studies reveal that the human error
represents a significant proportion of total equipment failures. There could be many
causes for the occurrence of human error including poor equipment design, poorly
written operating and maintenance procedures, poorly trained operation and maintenance manpower, task complexity, inadequate lighting in the work zone, and
improper tools used by maintenance personnel.
MTRF =
R (t ) d t
b
(13.1)
(13.2)
where
Rb (t)
Trp
Tdrf
N
is
is
is
is
the
the
the
the
Example 13.2
Assume that the failure rate of a robot is 0.0004 failure/h and its reliability is
expressed by the following equation:
R b (t ) = e b t
(13.3)
where
Rb (t) is the robot reliability at time t.
b
is the robot failure rate.
Substituting Equation (13.3) into Equation (13.1) yields
MTRF = e b t d t
0
(13.4)
1
=
b
Inserting the robot failure rate value into Equation (13.4), we get
MTRF =
1
0.0004
= 2500 h
Thus, the robot mean time to failure is 2500 h.
Mean Time to Robot-Related Problems
The mean time to robot-related problems (MTRP) simply is the average productive
robot time prior to the occurrence of a robot-related problem and is expressed by
(13.5)
where
Tdrp is the downtime due to robot-related problems, expressed in hours.
K is the total number of robot-related problems.
Example 13.3
Assume that at an industrial robot installation the total robot production hours were
25,000 h. Furthermore, downtime due to robot-related problems was 800 h and there
have been a total of 30 robot-related problems. Calculate the MTRP.
R b (t ) = exp b (t ) d t
(13.6)
where
b (t) is the robot time-dependent failure rate or hazard rate.
It is to be noted that Equation (13.6) can be used to obtain robot reliability for any
given robot time to failure distribution (e.g., Weibull, gamma, or exponential).
Example 13.4
A robot time to failure is exponentially distributed; thus, its failure rate, b, is 0.0004
failure/h. Calculate the robot reliability for a 15-h operating mission.
Thus, in this case, we have
b (t ) =
(13.7)
R b (t ) = exp b d t
(13.8)
= eb t
Substituting the specified data into Equation (13.8) we get
R b (15) = e ( 0.0004 ) (15)
= 0.994
Thus, there is a 99.4% chance that the robot will operate successfully during
the specified mission.
AND
PREDICTION METHODS
In the field of reliability engineering, there are many techniques and methods available that can be used effectively to perform robot reliability and availability analysis
and prediction studies. Some of these methods are network reduction technique,
fault tree analysis (FTA), failure modes and effect analysis (FMEA), and Markov
method [32].
In particular, the two widely used practical methods, i.e., part stress analysis
prediction and part count reliability prediction, can also be used to predict robot
system reliability during the design phase. Both of these methods are described in
Reference 33.
The Markov method is also a widely used approach in reliability engineering
and it can equally be applied to perform various kinds of robot reliability studies.
Two of its applications are presented in the next section.
OF A
ROBOT
The following symbols were used to develop Equations for the model:
i
(13.9)
P1 (t ) + h P1 (t ) = h P0 (t )
(13.10)
P2 ( t ) + P2 ( t ) = P0 (t )
(13.11)
h ( k1 + ) ( k1 + h ) k1 t ( k 2 + )2( k 2 + h ) k
+
e
e
k1 k 2 k1 ( k1 k 2 )
k 2 ( k1 k 2 )
where
k1 , k 2 =
b b 2 4 ( h + h + h )
2
b = + h + + h
k1 k 2 = h + h + h
k1 + k 2 = ( + h + + h )
(13.12)
P1 (t ) =
h h k1 + h k1 t ( + k 2 )2 h k
+
e
e
k1 k 2 k1 ( k1 k 2 )
k 2 ( k1 k 2 )
P2 (t ) =
h k1 + h k1 t ( h + k 22) k
+
e
e
k1 k 2 k1 ( k1 k 2 )
k 2 ( k1 k 2 )
(13.13)
(13.14)
h ( k1 + ) ( k1 + h ) k1 t
+
e
k1 k 2 k1 ( k1 k 2 )
(k + ) (k 2 + h ) k t
2
e 2
k 2 ( k1 k 2 )
(13.15)
As time t becomes very large, the robot system steady-state availability and other
state probabilities become
AR = lim A R (t ) =
t
h
k1 k 2
(13.16)
P1 = lim P1 (t ) =
h
k1 k 2
(13.17)
P2 = lim P2 (t ) =
h
k1 k 2
(13.18)
where
AR is the robot system steady state availability.
P1 is the robot system steady state failure probability due to human error.
P2 is the robot system steady state failure probability due to failures other
than human error.
In order to perform robot system reliability analysis, we set = h = 0, in Equations
(13.9) through (13.11) and then solve the resulting equations to get
+ t
P0 (t ) = e ( h )
(13.19)
(13.20)
(13.21)
P1 (t ) =
h
+ t
1 e ( h )
( h)
P2 (t ) =
+ t
1 e ( h )
( + h )
(13.22)
Similarly, the probability of the robot system failure, Ph (t), due to human error from
Equation (13.20) is
Ph (t ) = P1 (t ) =
h
1 R (t )
( + h ) [ r ]
(13.23)
MTTR r =
R (t ) d t
(13.24)
+ t
= e ( h ) dt
(13.25)
1
h +
The robot system overall hazard rate or its total failure rate is expressed by
r =
d R r (t )
R r (t ) d t
1
(h + ) t
= h +
+ t
d e ( h )
dt
(13.26)
Example 13.5
Assume that a robot system can fail either due to human error or other failures and
the human error and other failure times are exponentially distributed. Thus, the values
of human error and other failure rates are 0.0005 error per hour and 0.002 failure/h.
The failed robot system repair times are exponentially distributed with the repair
rate value of = h = 0.004 repair per hour. Calculate the robot system steady state
availability and its reliability for a 100-h mission.
Thus, we have
= 0.002 failure/h, h = 0.0005 error/h, = h = 0.004 repair/h, and t = 100 h.
Substituting the specified values into Equation (13.11) yields robot system steady
state availability
AR =
=
h
h
=
k1 k 2 h + h + h
(0.004)2
(0.004)2 + 0.004 (0.002 + 0.0005)
= 0.6154
Inserting the given data into Equation (13.22) we get
R r (100) = e ( 0.0005+0.002 ) (100 )
= 0.7788
Thus, the robot system steady state availability and reliability are 0.6154 and
0.7788, respectively.
OF A
REPAIRABLE/NON-REPAIRABLE
The following notation is associated with the model shown in Figure 13.4:
i
ith state of the system: i = 0 (robot and its associated safety system
operating normally), i = 1 (robot operating normally, robot safety
system failed), i = 2 (robot failed with an incident), i = 3 (robot failed
safely), i = 4 (robot failed but its associated safety system operated
normally).
Pi (t) The probability that the robot system is in state i at time t; for i = 0,
1, 2, 3, 4.
i
ith constant failure rate; i = s (state 0 to state 1), i = ri (state 1 to state 2),
i = rs (state 1 to state 3), i = r (state 0 to state 4).
Using the Markov approach, the following system of differential equations is
associated with Figure 13.4:
d P0 (t )
+ ( s + r ) P0 (t ) = 0
dt
(13.27)
d P1 (t )
+ ( ri + rs ) P1 (t ) = P0 (t ) s
dt
(13.28)
d P2 (t )
= P1 (t ) ri
dt
(13.29)
d P3 (t )
= P1 (t ) rs
dt
(13.30)
d P4 (t )
= P0 (t ) r
dt
(13.31)
(13.32)
s Ct
e e At
B
(13.33)
where
A = s + r
B = s + r rs ri
C = ri + rs
P2 (t ) =
ri s
1 A e Ct C e At B
AC
) ]
(13.34)
P3 (t ) =
rs s
1 A e Ct C e At B
AC
(13.35)
r
1 e At .
A
(13.36)
[ (
[ (
P4 ( t ) =
) ]
The reliability of both robot and its safety system working normally is given by
R rsu (t ) = P0 (t ) = e At
(13.37)
The reliability of the robot working normally with or without the safety system
functioning successfully is:
Rss ( t ) = P0 (t ) + P1 (t ) = e At +
s Ct
e e At .
B
(13.38)
The mean time to failure of the robot with the safety system up is expressed by:
MTTFrsu =
rsu
(t ) d t =
1
.
A
(13.39)
Similarly, the mean time to failure of the robot with safety system up or down is
expressed by:
MTTFss =
R
0
ss
(t ) d t =
1
(1 + s C) .
A
(13.40)
If the failed safety system shown in Figure 13.4 (State 1) can be repaired, the
robot system reliability will improve considerably. Thus, for constant failed safety
system repair rate, , from robot system state 1 to 0, the system of differential
equations for the new scenario is
d P0 (t )
+ A P0 (t ) = P1 (t )
dt
(13.41)
d P1 (t )
+ D P1 (t ) = P0 (t ) s
dt
(13.42)
where
D = ri + rs +
d P2 (t )
= P1 (t ) ri
dt
(13.43)
d P3 (t )
= P1 (t ) rs
dt
(13.44)
d P4 (t )
= P0 (t ) r
dt
(13.45)
e At
e r1 t
er t
+
+
P0 (t ) = e At + s
(13.46)
( r1 + A) ( r2 + A) ( r1 + A) ( r1 r2 ) ( r2 + A) ( r2 r1 )
where
r1, r2 =
E E2 4 F
2
E = A+C+
F = ri s + rs s + ri r + rs r + r
[(
P1 (t ) = s e r1t e1r2 t
P4 (t ) =
) (r r )]
(13.47)
[ (
) (r r ])
(13.48)
[ (
) (r r ])
(13.49)
P2 (t ) =
ri s
1 + r1 e r2 t r2 e r1t
r1 r2
P3 (t ) =
rs s
1 + r1 e r2 t r2 e r1t
r1 r2
1
r
e At
1 e At + s r
A
r1 r2 A A ( r1 + A) ( r2 + A)
(13.50)
e
e
+
+
r1 ( r1 + A) ( r1 r2 ) r2 ( r2 + A) ( r2 r1 )
r1t
rt
The reliability of both robot and its associated safety system working normally with
safety system repair facility from Equation (13.46) is
2
e At
e r1t
er t
R rsr (t ) = e At + s
+
+
(13.51)
( r1 + A) ( r2 + A) ( r1 + A) ( r1 r2 ) ( r2 + A) ( r2 r1 )
The reliability of the robot operating normally with or without the safety system
operating (but having the safety system repair facility) from Equations (13.46) and
(13.47) is
Rssr (t ) = e At +
s e r1t e r2 t
(r1 r2 )
) +
s
e At
( r1 + A) ( r2 + A)
e
e
+
+
(r1 + A) (r1 r2 ) (r2 + A) (r2 r1 )
r1t
rt
(13.52)
The mean time to failure of the robot with repair and with safety system operating
is expressed by
MTTFrsr =
rsr
(t ) d t
(13.53)
1
(1 + s F) .
A
Similarly, the mean time to failure of the robot with repair and with or without
the safety system operating is given by
MTTFssr =
ssr
(t ) d t
(13.54)
1
1 + s F) + s .
(
A
A
Example 13.6
Assume that a robot system is composed of a robot and a safety system/unit and
the operating robot with failed safety unit can either fail with an incident or safely.
The failure rates of robot and safety system/unit are 0.0004 failure per hour and
0.0002 failure per hour, respectively. In addition, the failures rates of the robot failing
with an incident or safely are 0.0001 failure per hour and 0.0003 failure per hour,
respectively. Calculate the following and comment on the end results:
The robot system mean time to failure when the safety system is operating.
More specifically, use Equation (13.39).
The robot system mean time to failure with repair. More specifically, use
Equation (13.53).
Thus, in this example, we have r = 0.0004 failure per hour, s = 0.0002 failure per
hour, ri = 0.0001 failure per hour, rs = 0.0003 failure per hour, and = 0.005 repair
per hour. Substituting the specified data into Equation (13.39) yields
MTTFrsu =
=
1
1
=
A s + r
1
(0.0002) + (0.0004)
= 1666.67 h
13.4 PROBLEMS
1. Discuss the historical developments in robotics with emphasis on robot
reliability and safety.
2. Discuss the basic hazards associated with robots.
3. State safety considerations in various phases of the robot life cycle.
4. Describe the following robot safeguard methods:
Intelligent systems
Infrared light arrays
Flashing lights
5. What are the important areas for human factors attention in robotic safety?
6. What are the main causes of robot failures?
7. Define the terms robot reliability and robot mean time to failure.
8. Assume that the times to failure of a robot are Weibull distributed. Obtain
an expression for the robot reliability.
9. What is the important difference between the following two terms:
Mean time to robot failure
Mean time to robot-related problems
10. A robot system can fail either due to human error or other failures. The
estimated values of human error and other failure rates are 0.0007 error
per hour and 0.004 failure per hour. The repair rate of the failed robot
system either due to human error or other failures is 0.01 repair per hour.
Calculate the steady state probability of the robot system being in the
failed state due to human error.
13.5 REFERENCES
1. Heer, E., Robots in modern industry, in Recent Advances in Robotics, Beni, G. and
Hackwood, S., Eds., John Wiley & Sons, New York, 1985, pp. 11-36.
2. Dhillon, B.S., Robot Reliability and Safety, Springer-Verlag, New York, 1991.
3. Zeldman, M.I., What Every Engineer Should Known About Robots, Marcel Dekker,
New York, 1984.
4. Worldwide Robotics Survey and Directory, Robot Institute of America, P.O. Box
1366, Dearborn, MI, 1983.
5. Rudall, B.H., Automation and robotics worldwide: reports and surveys, Robotica, 14,
243-251, 1996.
6. Parsad, H.P., Safety standards, in Handbook of Industrial Robots, Nof, S.Y., Ed., John
Wiley & Sons, New York, 1988.
7. An Interpretation of the Technical Guidance on Safety Standards in the Use of
Industrial Robots, Japanese Industrial Safety and Health Association, Tokyo, 1985.
8. Bonney, J.F. and Yong, J.F., Eds., Robot Safety, Springer-Verlag, New York, 1985.
9. Industrial Robots and Robot Systems: Safety Requirements, American National Standards Institute, New York, 1986.
10. Dhillon, B.S., On robot reliability and safety: Bibliography, Microelectronics and
Reliability, 27, 105-118, 1987.
11. Dhillon, B.S., Robot accidents, Proc. First Beijing Int. Conf. Reliability, Maintainability, and Safety, International Academic Publishers, Beijing, 1992.
12. Altamuro, V.M., Working safely with the iron collar worker, National Safety News,
38-40, 1983.
13. Nicolaisen, P., Safety problems related to robots, Robotics, 3, 205-211, 1987.
14. Study on Accidents Involving Industrial Robots, prepared by the Japanese Ministry
of Labour, Tokyo, 1982. NTIS Report No. PB 83239822, available from the National
Technical Information Service (NTIS), Springfield, VA.
15. Request for Assistance in Preventing the Injury of Workers by Robots, prepared by
the National Institute for Occupational Safety and Health, Cincinnati, OH, December
1984, NTIS Report No. PB 85236818, available from the National Technical Information Service (NTIS), Springfied, VA.
16. Ziskovsky, J.P., Working safety with industrial robots, Plant Eng., May, 81-85, 1984.
17. Ziskovsky, J.P., Risk analysis and the R3 factor, Proc. Robots 8 Conf., 2, 15.915.21,
1984.
18. Bellino, J.P. and Meagher, J., Design for safeguarding, Robots East Sem., Boston,
MA, October 9-11, 1985, pp. 24-37.
19. Russell, J.W., Robot safety considerations: A checklist, Professional Safety, December
1983, pp. 36-37.
20. Nicolaisen, P., Ways of improving industrial safety for the programming of industrial
robots, Proc. 3rd Int. Conf. Human Factors in Manufacturing, November 1986,
pp. 263-276.
21. Jiang, B.C., Robot Safety: Users Guidelines in Trends in Ergonomics/Human Factors
III, Karwowski, W., Ed., Elsevier, Amsterdam, 1986, pp. 1041-1049.
22. Addison, J.H., Robotic Safety Systems and Methods: Savannah River Site, Report
No. DPST-84-907 (DE 35-008261), December 1984, issued by E.I. du Pont de
Nemours & Co., Savannah River Laboratory, Aiken, SC.
23. Lembke, J.R., An Evaluation of Robot Perimeter Control Devices, Topical Report No.
705349, Report No. DBX-613-3031, Bendix (Kansas City Division), January 1984.
24. Zimmers, E.W., Human factors aspects of robotic safety, Proc. Robotic Industries
Assoc. (RIA) Robot Safety Sem., Chicago, April 24, 1986, pp. 1-8.
25. Engelberger, J.F., Robotics in Practice, Kogan Page, London, 1980.
26. Dhillon, B.S., On robot reliability and safety, Microelectronics and Reliability, 27,
105-118, 1987.
27. Dhillon, B.S., Robot Reliability and Safety, Springer-Verlag, New York, 1991.
28. Sato, K., Case study of maintenance of spot-welding robots, Plant Maintenance, 14,
28, 1982.
29. Sugimoto, N. and Kawaguchi, K., Fault tree analysis of hazards created by robots,
Proc. 13th Int. Symp. Industrial Robots, 1983, pp. 9.139.28.
30. Khodabandehloo, K., Duggan, F., and Husband, T.F., Reliability of industrial robots:
A safety viewpoint, Proc. 7th British Robot Assoc. Annu. Conf., 233-242, 1984.
31. Khodabandehloo, K., Duggan, F., and Husband, T.F., Reliability assessment of industrial robots, Proc. 14th Int. Symp. Industrial Robots, 209-220, 1984.
32. Dhillon, B.S., Reliability Engineering in Systems Design and Operation, Van Nostrand Reinhold Company, New York, 1983.
33. MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, U.S. Department
of Defense, Washington, D.C.
34. Dhillon, B.S. and Yang, N., Reliability analysis of a repairable robot system, J. Quality
Maintenance Eng., 2, 30-37, 1996.
14
Medical Device
Reliability
14.1 INTRODUCTION
Even though the history of Reliability Engineering goes back to World War II, the
serious thinking of the applications of reliability engineering concepts to medical
devices or equipment is not that old. Nowadays, there are encouraging signs that
both the engineers and manufacturers in the medical field are breaking away from
their conventional outlook. There could be various reasons for this new phenomenon;
government requirements, cost effectiveness, public demand, advances made in other
areas, etc. For example, because of the dependance of human life on the performance
and reliability of the myriad of medical devices, in the U.S. the two basic important
sources of regulation and control are the government (e.g., Food and Drug Administration (FDA)) and the legal system (e.g., product liability) [1]. Basically, both of
these factors have played an instrumental role in driving the medical device industry
to implement more and stringent controls on its products.
Therefore, today an engineer involved in the design of various medical products
must address the issue of the degree of reliability desired as a function of the specific
application. This is very crucial because a consequence of a failure may vary from
one application area to another. For example, harm to the patient could result from
a failure in a defibrillator; however, on the other hand, failure of a monitoring device
may be tolerated without directly endangering the patient.
The latter part of the 1960s may be regarded as the real beginning of the medical
device reliability field as this period witnessed the start of various publications on
the subject [26]. In 1980 an article entitled Bibliography of Literature on Medical
Equipment Reliability provided a comprehensive list of publications on the
subject [7]. In 1983, a book on reliability engineering devoted a chapter to medical
equipment reliability [8]. Since the early 1980s, many more publications on the field
have appeared and this chapter discusses various aspects related to the medical device
reliability.
successful human implant was performed in Buffalo in 1960 [9]. Today, total artificial heart and assist devices are under development and they are expected to play
an important role in the improvement of health care throughout the world. Nonetheless, some of the facts and figures directly or indirectly associated with the
medical devices are as follows:
There are approximately 5000 institutional, 600 associate, and 40,000
personal members of the AHA. The institutional members include short
and long term hospitals, headquarters of health care systems, hospital
affiliated educational programs, etc. and the associate members are made
up of organizations such as commercial firms, consultants, and suppliers.
The individuals working in the health care field, health care executive
assistants, full-time students studying hospital administration, etc. are
classified under the personal member category.
In 1974 dollars, the total assets of the AHA registered hospitals were in
the order of $52 billion [10].
Today, modern hospitals use over 5000 different medical devices ranging
from tongue depressors to sophisticated devices such as pacemakers.
Medical devices account for approximately $120 billion worldwide
market [11] and in 1984, medical devices using electronics accounted for
over $11 billion annual worldwide sales [1214].
In 1997, there were 10,420 registered medical device manufacturers in
the U.S. [15].
In the middle of the 1990s, 93% of medical devices had markets worth
less than $150 million [16].
The committee on hospitals (NFPA) stated 1200 deaths per year due to
faulty instrumentation [17, 18].
In 1990, an FDA study reported that approximately 44% of the quality
related problems that resulted in voluntary medical device recall during
the period from October 1983 to September 1989 were attributable to
deficiencies/errors that could have been eradicated by effective design
controls [19].
During the early half of the 1990s, over a period of five years, the top 10
orthopedic companies in the U.S. increased their regulatory affairs staff
by 39% [20].
Operator errors account for well over 50% of all technical medical equipment problems [21].
According to the findings of the Philadelphia Emergency Care Research
Institute, from 4 to 6% of hospital products were sufficiently dangerous
to warrant immediate correction. These findings were based on testing of
a sample of 15,000 different hospital products [21].
In 1990, the U.S. Congress passed the Safe Medical Device Act (SMDA)
that empowered the FDA to implement the Preproduction Quality Assurance program. This program requires/encourages medical device manufacturers to address deficiencies in product design contributing to failures.
In delivering the judgment on the case, Justice John Paul Stevens wrote
Medtronic argument is not only unpersuasive, it is implausible [25]. The Supreme
Courts decision has put additional pressure on manufacturers to produce safe and
reliable medical devices.
The number of recalls associated with each of these categories was 40, 13, 54, 10,
17, 40, 27, 25, and 4, respectively. It is interesting to note that faulty product design,
product contamination, mislabeling, and defects in material selection and manufacturing accounted for 70% of the medical device recalls. The components of the faulty
product design category were premature failure, alarm defects, potential for malfunction, electrical interference, failure to perform as required, and potential for leakage
of fluids into electrical components. Four subcategories of the product contamination
were defective package seals, non-sterility, other package defects, and other faults.
The problem areas of the mislabeling category were incomplete labeling, inadequate
labeling, disparity between label and product, and misleading or inaccurate labeling.
There were five specific problems in the material selection and manufacturing category: inappropriate materials, manufacturing defects, separation of bonded components, actual or potential breakage/cracking, and material deterioration.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
design review
reliability assessment
parts/materials quality control
software quality control
design transfer
labeling
certification
test instrumentation
manpower
quality monitoring subsequent to the design phase.
14.5.1 ORGANIZATION
The formal documentation should contain the organizational elements and authorities
desirable to develop the preproduction quality assurance (PQA) program, to execute
program requirements, and to achieve program goals. Also, it is essential to formally
assign and document the responsibility for implementing the entire program and
each program element. Audit the PQA program periodically and update it as experience is gained.
14.5.2 SPECIFICATIONS
After establishing desired device design characteristics such as physical, chemical,
performance, etc., translate them into written design specifications. Specifications
should address performance characteristics, as applicable, such as reliability, safety,
stability, and precision, important to meet the devices end objective. It is important
to remember that these preliminary specifications provide the vehicle with which to
develop, control, and evaluate the device design.
Evaluate and document changes made to the specifications during research and
development, in particular, ensure that the products safety or effectiveness is not
compromised by these changes. All in all, for the sake of effectiveness of specifications, the qualified individuals from areas such as reliability, quality assurance,
manufacturing, and marketing should be asked to review the specification document.
14.5.8 LABELING
Labeling includes items such as display labels, manuals, inserts, software for cathode
ray tube (CRT) displays, panels, charts, and recommended test and calibration
protocols. Design reviews are a useful vehicle to ensure that the labeling is in
compliance with laws and regulations as applicable and the directions for the device
usage can be understood easily by the user community.
14.5.9 CERTIFICATION
After passing the preproduction qualification testing by the initial production units,
it is essential to perform a formal technical review to assure adequacy of the design,
production, and quality assurance approaches/procedures. In addition, the review
should incorporate a determination of items such as those listed below.
Overall adequacy of quality assurance plan.
Resolution of any discrepancy between the standards/procedures
employed to construct the design during the research and development
phase and those identified for the production phase.
Adequacy of specifications.
Suitability of test approaches employed for evaluating compliance with
the approved specifications.
Adequacy of specification change control program.
Resolution of any discrepancy between the approved specifications for
the device and the end produced device.
14.5.11 MANPOWER
Only qualified and competent individuals must be allowed to perform device associated design activities such as design review, analysis, and testing.
During the treatment of an infant patient with oxygen, a physician set the
flow control knob between 1 and 2 liters per minute without realizing the
fact that the scale numbers represented discrete, instead of continuous,
settings. As the result of the physicians action, the patient became
hypoxic [30].
capability, ambiguous part arrangements, poor design for easy cleaning, difficulty
in reaching or manipulating screws/parts, inadequate labeling, coding, or numbering
of parts, difficulty in locating parts visually or by touch, and requirements for
difficult-to-find tools [30].
Complaints
Observations
Installation Problems
Incidents
Some examples of the problems under the complaint group are poorly located
or labeled controls, illogical and confusing device operation, difficult to read or
understand displays, annoying device alarms, and difficult to hear or distinguish
alarms. The items belonging to the observation group include slow and arduous
training, refusal of the staff to use the device, and only a few individuals seem to
be able to use the device correctly. Some of the installation related problems are
accessories installed incorrectly, frequently parts become detached, involved individuals find installation of accessories difficult, confusing, or overly time-consuming,
and alarms and batteries fail often. One important element of the incident category
is that highly competent people are involved in accidents or near-misses. More
specifically, this could be a warning.
It is important to consider the means of assessing usability of medical devices
prior to their procurement, particularly if they are to be used for life-sustaining or
life-supporting purposes. Some of the associated steps include the following:
Determine if the manufacturer has performed human factors/usability
testing of the device under consideration.
Review published evaluation data concerning the device under consideration.
Determine from staff and other sources about the human factor related
performance concerning predecessor device models produced by the same
manufacturer.
Negotiate a trial period prior to the actual procurement of the device under
consideration.
Make contact with facilities that have already used the device under
consideration.
FOR IMPROVING
MEDICAL
The approaches such as functional testing, error testing, free-form testing, safety
testing, and white-box testing can be improved through automation.
Servicing/repair reports
User/distributor records
Published/unpublished literature sources
The manufacturers product complaint-handling mechanism
In-house research/testing evaluation/etc. records
Legal records
Sales representative contacts
Technical service customer contacts.
9.
10.
11.
12.
13.
TABLE 14.1
Aerospace and Medical Industries Approach to Selected Reliability Factors
No.
Reliability
factor
Aerospace industrys
approach
Stress screening
Part quality
Predictions
Design margins
MIL-HDBK-2164 [48] is
used for control.
MIL-STD-883 [49] is used
for control.
MIL-HDBK-217 [50] is
used for control.
MIL-HDBK-251 [52] is
used for control.
Complexity
Testing and
verification
MIL-HDBK-338 [53] is
used for control.
MIL-STD-781 [54] and
MIL-HDBK-781 [55] are
used for control.
Other Professionals
Keep in the back of your mind that probably the largest single expense
in a business organization is the cost of failures. Such failures could
be associated with people, equipment, business systems, and so on.
The cost of business can be decreased quite significantly by reducing
such failures.
The manufacturer is responsible for device reliability during the design
and manufacturing phases, but during field use the user must use the
device according to design conditions. More specifically, both parties
are expected to accept their responsibilities for the total success.
Make comparison of the human body and medical device failures. Both
of them need positive actions from reliability engineers and doctors to
improve device reliability and extend human life, respectively.
Recognize that failures are the cause of poor medical device reliability
and positive thinking and actions can improve device reliability.
Remember that the applications of reliability principles have successfully improved the reliability of aerospace systems and such applications to medical devices will generate dividends of a similar nature.
14.12 PROBLEMS
1. Make a comparison of medical device reliability with aerospace system
reliability.
2. Write an essay on medical device failures.
3. Discuss at least five typical important reasons for the medical device
recalls in the U.S.
4. State measures proposed by the FDA to produce reliable devices.
5. Give at least five examples of human error in the health care system.
6. Discuss human factors with respect to medical devices.
7. What are the two software classifications used by the FDA? Describe both
of them in detail.
8. Discuss the following types of manual software testing:
Functional testing
Safety testing
Error testing
9. Compare manual software testing with automated software testing.
10. Discuss the possible sources of MDR-type reportable incidents or events.
14.13 REFERENCES
1. Bell, D.D., Contrasting the medical-device and aerospace-industries approach to
reliability, Proc. Annu. Reliability Maintainability Symp., 125127, 1995.
2. Johnson, J.P., Reliability of ECG instrumentation in a hospital, Proc. Annu. Symp.
Reliability, 314-318, 1967.
3. Crump, J.F., Safety and reliability in medical electronics, Proc. Annu. Symp. Reliability, 320-322, 1969.
4. Gechman, R., Tiny flaws in medical design can kill, Hosp. Top., 46, 23-24, 1968.
5. Meyer, J.L., Some instrument induced errors in the electrocardiogram, J. Am. Med.
Assoc., 201, 351-358, 1967.
6. Taylor, E.F., The effect of medical test instrument reliability on patient risks, Proc.
Annu. Symp. Reliability, 328-330, 1969.
7. Dhillon, B.S., Bibliography of literature on medical equipment reliability, Microelectronics and Reliability, 20, 737-742, 1980.
8. Dhillon, B.S., Reliability Engineering in Systems Design and Operation, Van Nostrand Reinhold Company, New York, 1983.
9. Eberhard, D.P., Qualification of high reliability medical grade batteries, Proc. Annu.
Reliability Maintainability Symp., 356362, 1989.
10. Fairhurst, G.F. and Murphy, K.L., Help wanted, Proc. Annu. Reliability Maintainability Symp., 103-106, 1976.
11. Murray, K., Canadas medical device industry faces cost pressures, regulatory reform,
Medical Device and Diagnostic Industry Magazine, 19(8), 30-39, 1997.
12. Bassen, H., Sillberberg, J., Houston, F., and Knight, W., Computerized medical
devices: Trends, problems, and safety, IEEE Aerospace and Electronic Systems (AES)
Magazine, September 1986, pp. 20-24.
13. Dickson, C., World medical electronics market: An overview, Medical Devices and
Diagnostic Products, May 1984, pp. 53-58.
14. World Medical Electronics Markets 1985 Yearbook, Market Intelligence Research
Company, Palo Alto, CA, March 1985.
15. Allen, D., California home to almost one-fifth of U.S. medical device industry,
Medical Device and Diagnostic Industry Magazine, 19(10), 64-67, 1997.
16. Bethune, J., The cost effectiveness bugaboo, Medical Device and Diagnostic Industry
Magazine, 19(4), 12-15, 1997.
17. Walter, C.W., Electronic News, January 27, 1969.
18. Micco, L.A., Motivation for the biomedical instrument manufacturer, Proc. Annu.
Reliability Maintainability Symp., 242-244, 1972.
19. Schwartz, A.P., A call for real added value, Medical Industry Executive, February/March, 1994, pp. 5-9.
20. Allen, R.C., FDA and the cost of health care, Medical Device and Diagnostic Industry
Magazine, 18(7), 28-35, 1996.
21. Dhillon, B.S., Reliability technology in health care systems, Proc. IASTED Int. Symp.
Comp. Adv. Technol. Med. Health Care Bioeng., 84-87, 1990.
22. OLeary, D.J., International standards: Their new role in a global economy, Proc.
Annu. Reliability Maintainability Symp., 17-23, 1996.
23. Hooten, W.F., A brief history of FDA good manufacturing practices, Medical Device
and Diagnostic Industry Magazine, 18(5), 96, 1996.
24. Cady, W.W. and Iampietro, D., Medical Device Reporting, Medical Device and
Diagnostic Industry Magazine, 18(5), 58-67, 1996.
25. Bethune, J., Ed., On product liability: Stupidity and waste abounding, Medical Device
and Diagnostic Industry Magazine, 18(8), 8-11, 1996.
26. MIL-STD-1629A, Procedures for Performing a Failure Mode, Effects, and Criticality
Analysis, Department of Defense, Washington, D.C.
27. ANSI/IEEE-STD-730-1984, IEEE Standard for Software Quality Assurance Plans,
American National Standards Institute (ANSI), New York, 1984.
28. Bogner, M.S., Ed., Human Error in Medicine, Lawrence Erlbaum Associates, Hillsdale, NJ, 1994.
29. Maddox, M.E., Designing medical devices to minimize human error, Medical Device
and Diagnostic Industry Magazine, 19(5), 166-180, 1997.
30. Sawyer, D., Do It By Design: Introduction to Human Factors in Medical Devices,
Center for Devices and Radiological Health (CDRH), Food and Drug Administration
(FDA), Washington, D.C., 1996.
31. Nobel, J.L., Medical devices failures and adverse effects, Pediat-Emerg. Care, 7, 120123, 1991.
32. Bogner, M.S., Medical devices and human error, in Human Performance in Automated
Systems: Current Research and Trends, Mouloua, M. and Parasuraman, R., Eds.,
Lawrence Erlbaum Associates, Hillsdale, NJ, 1994, pp. 64-67.
33. Casey, S., Set Phasers on Stun: and Other True Tales of Design Technology and
Human Error, Aegean, Inc., Santa Barbara, CA, 1993.
34. McDonald, J.S. and Peterson, S., Lethal errors in anaesthesiology, Anesthesiol, 63,
A 497, 1985.
35. Wood, B.J. and Ermes, J.W., Applying hazard analysis to medical devices, Part II:
Detailed hazard analysis, Medical Device and Diagnostic Industry Magazine, 15(3),
58-64, 1993.
36. Weide, P., Improving medical device safety with automated software testing, Medical
Device and Diagnostic Industry Magazine, 16(8), 66-79, 1994.
37. Levkoff, B., Increasing safety in medical device software, Medical Device and Diagnostic Industry Magazine, 18(9), 92-101, 1996.
38. Federal Food, Drug, and Cosmetic Act, as Amended, Sec. 201 (h), U.S. Government
Printing Office, Washington, D.C., 1993.
39. Onel, S., Draft revision of FDAs medical device software policy raises warning flags,
Medical Device and Diagnostic Industry Magazine, 19(10), 82-92, 1997.
40. Mojdehbakhsh, R., Tsai, W., and Kirani, S., Retrofitting software safety in an implantable medical device, Trans. Software, 11, 41-50, 1994.
41. Heydrick, L. and Jones, K.A., Applying reliability engineering during product development, Medical Device and Diagnostic Industry Magazine, 18(4), 80-84, 1996.
42. IEEE-STD-1228, Standard for Software Safety Plans, Institute of Electrical and
Electronic Engineers (IEEE), New York, 1994.
43. Cady, W.W. and Iampietro, D., Medical device reporting, Medical Device and Diagnostic Industry Magazine, 18(5), 58-67, 1996.
44. Thibeault, A., Documenting a failure investigation, Medical Device and Diagnostic
Industry Magazine, 19(10), 68-74, 1997.
45. Rose, H.B., A small instrument manufacturers experience with medical instrument
reliability, Proc. Annu. Reliability Maintainability Symp., 251-254, 1972.
46. Taylor, E.F., The reliability engineer in the health care system, Proc. Reliability and
Maintainability Symp., 245-248, 1972.
47. Bell, D.D., Contrasting the medical-device and aerospace-industries approach to
reliability, Proc. Annu. Reliability Maintainability Symp., 125-127, 1995.
48. MIL-STD-2164, Environment Stress Screening Process for Electronic Equipment,
Department of Defense, Washington, D.C.
49. MIL-STD-883, Test Methods and Procedures for Microelectronics, Department of
Defense, Washington, D.C.
50. MIL-HDBK-217, Reliability Prediction of Electronic Equipment, Department of
Defense, Washington, D.C.
51. Reliability Prediction Procedure for Electronic Equipment, Report No. TR-NWT000332, issue 4, Bell Communication Research, New Jersey, 1992.
15
Design Maintainability
and Reliability
Centered Maintenance
15.1 INTRODUCTION
Maintainability is a design and installation characteristic that imparts to an item a
greater inherent ability to be maintained, and consequently results in factors such
as a decrease in required maintainability, tools, skill levels, facilities, and manhours.
The history of maintainability can be traced back to 1901 in the Army Signal
Corps Contract with the Wright brothers to develop an airplane. This document
contained a clause that the aircraft under consideration be simple to operate and
maintain [1]. Various studies conducted during the period between World War II
and the early 1950s by the U.S. Department of Defense indicated startling results
concerning the state of reliability and maintainability of equipment used by the three
services [2, 3]. For example, an Army study revealed that approximately between
two-thirds and three-fourths of its equipment were either under repair or out of
service.
In 1956, a series of articles appeared in Machine Design covering areas such as
design electronic equipment for maintainability, recommendations for designing
maintenance access in electronic equipment, design recommendations for test points,
and factors to consider in designing displays [4]. In 1959, the U.S. Department of
Defense released a document (i.e., MIL-M-26512) on maintainability specification.
In fact, an appendix to this document contained an approach for planning a maintainability demonstration test.
In 1960, the first commercially available book entitled Electronic Maintainability
appeared [5]. Since the 1960s a large number of publications on various aspects of
maintainability has appeared [6, 7].
Reliability centered maintenance (RCM) is a systematic methodology employed
to highlight the preventive maintenance tasks necessary to realize the inherent
reliability of an item at the lowest resource expenditure. As per the published sources,
the original development of the RCM concept could be traced back to the late 1960s
in the commercial aircraft industry. For example, in 1968 a document (MSG 1)
entitled Maintenance Evaluation and Program Development was prepared by the
U.S. Air Transport Association (ATA) for use with the Boeing 747 aircraft [8]. At a
later date, MSG 1 was revised to handle two other wide-body aircrafts: L-1011 and
DC-10. The revised version became known as MSG 2 [9].
In 1974, the U.S. Department of Defense commissioned United Airlines to
prepare a document on processes used by the civil aviation industrial sector to
develop maintenance programs for aircraft [10]. The resulting document was entitled
Reliability Centered Maintenance by United Airlines.
MSG 2 was revised by the ATA in 1980 to include maintenance programs for
two more aircraft: Boeing 756 and 767. The revised version was named MSG 3 [11].
In Europe, MSG 3 became known as European MSG 3 because it included
maintenance programs for two more aircraft: A-300 and Concorde [12]. In the 1970s,
U.S. forces were attracted by the RCM methodology and in 1985 the military
published two documents entitled Guide to Reliability Centered Maintenance [13]
and Reliability Centered Maintenance for Aircraft, Engines, and Equipment [14].
Today, RCM methodology is actively being practiced in many different parts of the
world.
This chapter describes design maintainability and RCM.
RELIABILITY
This is a design characteristic that results in durability of the system/equipment to
carry out its designated function under a stated condition and time period. It is
accomplished through various actions including controlling processes, selecting
optimum engineering principles, testing, and adequate component sizing. Nonetheless, there are many specific general principles of reliability: design for simplicity,
use less number of parts to perform multiple functions, design to minimize the
occurrence of failures, maximize the use of standard parts, provide fail safe designs,
use parts with proven reliability, minimize stress on parts, provide satisfactory safety
factors between strength and peak stress values, provide redundancy when required,
and provide for simple periodic adjustment of parts subject to wear [18].
MAINTAINABILITY
This is a built-in design and installation characteristic that provides the end product/equipment an inherent ability to be maintained, thus ultimately resulting in lower
maintenance cost, required skill levels, man-hours required, required tools and equipment, and improved mission availability.
Some of the specific general principles of maintainability are reduce mean time
to repair (MTTR), lower life cycle maintenance costs, provide for maximum interchangeability, consider benefits of modular replacement vs. part repair or throwaway
design, reduce or eliminate altogether the need for maintenance, establish the extent
of preventive maintenance to be performed, reduce the amount, frequency, and
complexity of required maintenance tasks, and lower amount of supply supports
required [18].
Hazards and accidents could be the result of unsatisfactory attention given to safety
during the design phase. Nonetheless, the human safety guidelines include installing
appropriate fail-safe devices, carefully studying the potential sources of injury by
electric shock, fitting all access openings with appropriate fillets, providing appropriate emergency doors, installing items requiring maintenance such that hazard in
accessing them is minimized, and providing adequate amount of tracks, guides, and
stops to facilitate equipment handling.
Interchangeability
It means that a given component can be replaced by any similar part, and the
replacing part can carry out the required functions of the replaced part effectively.
There are various factors that must be considered carefully in determining needs for
interchangeability including cost effectiveness of manufacture and inspection, and
field conditions. Maximum interchangeability can only be assured if the design
professionals consider factors such as providing adequate level of information in the
task instructions, physical similarities (include shape, size, mounting, etc.), and nonexistence of physical interchangeability in places where functional interchangeability
is not expected.
Equipment Packaging
This is an important maintainability factor and is basically concerned with the
manner in which equipment is packaged: ease of parts removal, item layout, access,
and so on. In order to achieve the effectiveness of equipment packaging, careful
attention must be given to factors such as accessibility requirements, modularization
needs, manufacturing requirements, reliability and safety factors, environmental
factors, and standardization needs.
In particular, modularization and accessibility are very important and are
described in more detail. Modularization is the division of a product into distinct
physical and functional units to assist removal and replacement. The advantages of
modular construction include reduction in maintenance time and cost, usually easily
maintainable product, shorter design times because of simplified design, and requires
relatively low skill levels and fewer maintenance tools.
On the other hand, accessibility is the relative ease with which an item can be
reached for actions such as repair, inspection, or replacement. The accessibility is
affected by factors such as types of tools required, access usage frequency, type of
task to be performed, hazards involved with respect to access usage, and distance
to be reached.
Human Factors
As maintainability depends on both the maintainer and the operator, human factors
are very important. For example, one aircraft maintenance study reported that over
a period of 15 months, human error resulted in 475 accidents and incidents in flight
and ground operations [1, 20]. Furthermore, most of the accidents happened shortly
after periodic inspections and 95 aircraft were damaged or destroyed with a loss of
14 lives. In human factors, environment plays an important role since it can vary
quite considerably from one application to another. Environment can be classified
into three categories: physical, operational, and human. The physical category
includes factors such as temperature, noise, vibration, radiation, pressure, and dust.
The components of the operational category are work duration, illumination, maintenance workspace arrangement, acoustics, ventilation, etc. The factors belonging
to the human category include physical, psychological, human limitations, and
physiological.
MTTR =
j =1
j Tj
j =1
(15.1)
where
Tj is the corrective maintenance or repair time required to repair unit j; for
j = 1, 2, 3, , k.
k is the total number of units.
j is the constant failure rate of unit j; for j = 1, 2, 3, , k.
Example 15.1
An engineering system is made up of six replaceable subsystems 1, 2, 3, 4, 5, and
6 with constant failure rates 1 = 0.0002 failures/h, 2 = 0.0002 failures/h, 3 =
0.0001 failures/h, 4 = 0.0004 failures/h, 5 = 0.0005 failures/h, and 6 = 0.0006
failures/h, respectively. Corrective maintenance times associated with subsystems 1,
2, 3, 4, 5, and 6 are T1 = 1 h, T2 = 1 h, T3 = 0.5 h, T4 = 2 h, T5 = 3 h, and T6 = 3.5 h,
respectively. Estimate the system MTTR.
Inserting the given values into Equation (15.1) yields
MTTR =
= 2.425 h
Thus, the system mean time to repair is 2.425 h.
Median Corrective Maintenance Time
It is the time within which 50% of all corrective maintenance actions can be
performed. The measure is dependent upon the probability density function describing the times to repair (e.g., exponential, lognormal). Thus, the median corrective
maintenance time (MCMT) for the exponential distribution is expressed by
MCMT = 0.69
(15.2)
where
is the constant repair rate, thus the reciprocal of the MTTR.
Similarly, the MCMT for the lognormal distribution is given by
(15.3)
where
2 is the variance around the mean value of the natural logarithm of repair
times.
(15.4)
where
Tmax is the maximum corrective maintenance time.
C is equal to 2.312 or 3 for the 90th and 95th percentiles, respectively.
Normal
The maximum corrective maintenance time is expressed by
Tmax = MTTR + m n
(15.5)
where
m is equal to 1.28 or 1.65 for the 90th and 95th percentiles, respectively.
n is the standard deviation of the normally distributed maintenance time.
Lognormal
The maximum corrective maintenance time is given by
Tmax = antilog (t a + m l )
(15.6)
where
ta is the average of the logarithms of the repair times.
l is the standard deviation of the logarithms of the repair times.
Mean Preventive Maintenance Time
Various preventive maintenance related activities such as inspections, calibrations,
and tuning are performed to keep the item or equipment at a specified performance
MPMT =
j =1
Tpj fpj
pj
(15.7)
j =1
where
MPMT
n
fpj
Tpj
M (t ) = frp ( t ) d t
(15.8)
where
t
is the time.
M (t) is the maintainability function.
frp (t) is the probability density function of the repair times.
Maintainability functions for exponential, lognormal, and normal repair time distribution are obtained below.
Exponential
The probability density function representing corrective maintenance times is defined
by
frp ( t ) = exp ( t )
where
t is the variable repair time.
is the constant repair rate or the reciprocal of the MTTR.
(15.9)
M (t ) = exp ( t ) d t
(15.10)
= 1 exp ( t )
Since = 1/MTTR, Equation (15.10) becomes
M (t ) = 1 exp ( t MTTR)
(15.11)
Example 15.2
After performing analysis of repair actions associated with an electric generator, it
is established that the generators repair times are exponentially distributed with a
mean value of 4 h. Calculate the probability of accomplishing a repair in 5 h.
By inserting the given data into Equation (15.11), we get
M (5) = 1 exp ( 5 4)
= 0.7135
It means that there is an approximately 72% chance that the repair will be
completed within 5 h.
Lognormal
This is a widely used probability distribution in maintainability work and its probability density function representing corrective maintenance times is given by
frp (t ) =
(t )
2
1
exp {ln (t ) }
2
(15.12)
where
is a constant denoting the shortest time below which no maintenance
activity can be performed.
is the mean of the natural logarithms of the maintenance times.
is the standard deviation with which the natural logarithms of the maintenance times are spread around the mean .
The following relationship defines the mean:
= ln t1 + ln t 2 + ln t 3 + + ln t k
(15.13)
where
tj is the maintenance time j; for j = 1, 2, 3, , k.
k is the number of maintenance times.
The standard deviation, , is expressed by the following relationship:
(ln t )
j
j =1
( k 1)
12
(15.14)
M (t ) =
tf
rp
(t ) d t
(15.15)
1 ln t
dt
exp 2
1
2
Normal
Corrective maintenance times can also be represented by the normal distribution.
The distribution repair time probability density function is defined by
frp ( t ) =
1 t 2
1
exp
2
2
(15.16)
where
is the mean of maintenance times
is the standard deviation of the variable maintenance time t around the
mean .
By substituting Equation (15.16) into Equation (15.8), we get
1
M (t ) =
2
1 t 2
exp
dt
2
(15.17)
t
j =1
(15.18)
where
tj is the maintenance time j; for j = 1, 2, 3, , k.
k is the total number of maintenance times.
The standard deviation is
(t )
j
j =1
( k 1)
12
AND
(15.19)
COMMON
Over the years professionals involved with maintainability have developed many
guidelines that can be used to produce effective design with respect to maintainability. These guidelines include design for minimum maintenance skills, provide for
visual inspection, use standard interchangeable parts, design for minimum tools and
adjustments, avoid the use of large cable connectors, design for safety, provide test
points, use color-coding, label units, provide handles on heavy parts for the ease of
handling, use captive-type chassis fasteners, provide troubleshooting techniques,
group subsystems, and use plug-in rather than solder-in modules [21].
During the design phase of an equipment, there are many errors committed that
adversely affect its maintainability. Some of the common design errors are placing
an adjustment out of arms reach, omitting handles, using access doors with numerous small screws, placing removable parts such that they cannot be dismantled
without taking the entire unit from its case, locating adjustable screws close to a hot
component/an exposed power supply terminal, placing adjustable screws in locations
cumbersome for maintenance personnel to discover, providing unreliable built-in
test equipment, locating fragile parts just inside the bottom edge of the chassis where
the maintenance personnel are expected to place their hands, using chassis and cover
plates that drop when the last screw is taken out, placing low-reliability parts beneath
other parts, placing screwdriver adjustments beneath modules, and providing insufficient space for maintenance personnel to get their gloved hands into the unit to
perform required adjustment [21].
improved through maintenance and good maintenance practices can only preserve
such characteristics. The RCM philosophy calls for the performance of scheduled
maintenance on critical components only under the following circumstances:
It will stop a decrease in reliability and/or degradation of safety to unacceptable levels or
It will decrease equipment life cycle cost.
In contrast, the RCM philosophy also calls for not performing scheduled maintenance on noncritical items unless it will lower equipment life cycle cost.
AND
QUESTIONS
logic elements in the fault tree. The sensitivities (sensitivity or conditional probability
may be defined as the probability that an occurrence of a basic fault event will lead
to a safety incident) are calculated by assigning a probability of unity to an elementary or basic fault event and then determining a safety incidents resultant probability.
In turn, the criticality of each basic fault event is calculated by using these sensitivity
results.
Criticality may simply be described as a measure of the relative seriousness or
impact of each and every fault event on the undesired or top fault event. It (i.e.,
criticality) involves both qualitative and quantitative fault tree analyses and provides
a base mechanism to rank the fault events in their order of severity.
In quantitative terms, the criticality is expressed below [26].
C = P ( y) P ( F y)
(15.20)
where
P(y) is the probability of occurrence of the basic fault event, y.
P(F y) is the sensitivity or conditional probability that an occurrence of a
basic fault event will lead to a safety incident.
C
is the criticality.
Use Decision Logic to Important Failure Modes
This step is concerned with the RCM decision logic designed to result in, by asking
standard assessment questions, the most effective preventive maintenance task combinations. Each and every critical failure mode highlighted by the FTA is reviewed
by applying the decision logic for each maintenance-important item. Judgments are
passed as to the need for various maintenance tasks as each and every failure mode
is processed. The tasks considered necessary along with the intervals established to
be appropriate, form the overall scheduled preventive maintenance program.
The decision logic is composed of the following two levels:
Level 1. This is concerned with the evaluation of each failure mode for
determining consequence category: hidden safety, evident safety, operational economic, non-operational economic, or non-safety/economic. Four
questions associated with this level are as follows:
1. Does the failure or fault cause a direct adverse effect on operating
performance?
2. Does the failure/secondary failure lead to a safety incident?
3. Can operator(s) detect failures or faults when they occur?
4. Does the hidden failure itself, or the combination of a hidden failure
plus an additional failure of a system-related or back-up function, lead
to a safety incident?
Level 2. This is concerned with taking into consideration the failure causes
for each failure mode for choosing the specific type of task. Twenty-one
questions are asked in this level [23]. Some examples of those questions
are as follows:
Is an operator monitoring task effective and applicable?
Is a check for verifying operation/detecting impending failure effective
and applicable?
Is a servicing task effective and applicable?
Record Maintenance Categories
This step is concerned with applying the decision logic of the previous step to group
maintenance requirements into three categories (i.e., hard-time maintenance requirements, on-condition maintenance requirements, and condition monitoring maintenance requirements) and defining a maintenance task profile. This task profile basically highlights part number and failure mode, and the preventive-maintenance taskselection for RCM logic questions with answer yes. Subsequently, the maintenance-task profile is used for determining applicable preventive maintenance tasks
to each part under consideration.
Implement RCM Decisions
Subsequent to the establishment of the maintenance-task profile, the task frequencies/intervals are established and enacted as part of the overall maintenance plan.
Apply Sustaining Engineering Based on Real-life
Experience Information
As the RCM process has a life-cycle perspective, the main objective in this step is
to reduce scheduled maintenance burden and support cost while keeping the equipment/systems in a desirable operational-readiness state. After the system is operational and field data start to accumulate, one of the most pressing steps is to reassess
all RCM default decisions.
All in all, it must be remembered that the key objective is to eliminate all
excessive maintenance related costs while maintaining established and required
reliability and safety levels.
15.4
PROBLEMS
1. Define the term Reliability Centered Maintenance and discuss its historical developments.
2. Discuss the following terms:
Maintainability
Interchangeability
Standardization
3. Write an essay on design maintainability.
4. What are the important factors that affect the maintainability aspects of
an equipment design?
5. What are the issues addressed during the preliminary design review of an
equipment from the stand point of maintainability?
6. An electronic system is composed of four subsystems 1, 2, 3, and 4. The
time to failure of each subsystem is exponentially distributed. The failure
rates of subsystems 1, 2, 3, and 4 are 1 = 0.01 failure/h, 2 = 0.03 failure/h,
3 = 0.04 failure/h, and 4 = 0.05 failure/h, respectively. In addition, the
corrective maintenance times associated with subsystems 1, 2, 3, and 4
are T1 = 2 h, T2 = 4 h, T3 = 1 h, and T4 = 3 h, respectively. Calculate the
electronic system mean time to repair.
7. Obtain a maintainability function when the probability density function
of the repair times is described by the Weibull distribution.
8. Compare traditional maintenance with RCM.
9. Describe the RCM methodology process.
10. What are the benefits of applying the RCM methodology?
15.5 REFERENCES
1. AMCP 706-133, 1976, Engineering Design Handbook: Maintainability Engineering
Theory and Practice, prepared by the Army Material Command, Department of the
Army, Washington, D.C.
2. Shooman, M.L., Probabilistic Reliability: An Engineering Approach, McGraw-Hill,
New York, 1968.
3. Moss, M.A., Minimal Maintenance Expense, Marcel Dekker, New York, 1985.
4. Retterer, B.L. and Kowalski, R.A., Maintainability: A historical perspective, IEEE
Trans. Reliability, 33, 56-61, 1984.
5. Akenbrandt, F.L., Ed., Electronic Maintainability, Engineering Publishers, Elizabeth,
NJ, 1960.
6. Dhillon, B.S., Reliability Engineering in Systems Design and Operation, Van Nostrand Reinhold Company, New York, 1983.
7. Dhillon, B.S., Reliability and Quality Control: Bibliography on General and Specialized Areas, Beta Publishers, Gloucester, Canada, 1993.
8. MSG 1, Maintenance Evaluation and Program Development, 747 Maintenance Steering Group Handbook, Air Transport Association, Washington, D.C., July 1968.
9. MSG 2, Airline/Manufacturer Maintenance Program Planning Document, Air Transport Association, Washington, D.C., September 1980.
10. Moubray, J., Reliability Centered Maintenance, Industrial Press, New York, 1992.
11. MSG 3, Airline/Manufacturer Maintenance Program Planning Document, Air Transport Association, Washington, D.C., September 1980.
12. Anderson, R.T., Reliability Centered Maintenance: Management and Engineering
Methods, Elsevier Applied Science Publishers, London, 1990.
13. U.S. AMC Pamphlet 750-2, Guide to Reliability Centered Maintenance, U.S. Army,
Department of Defense, Washington, D.C., 1985.
14. MIL-STD-1843, Reliability Centered Maintenance for Aircraft, Engines, and Equipment, Department of Defense, Washington, D.C., 1985.
15. Hall, A.C., Relationship between operational effectiveness of electronic systems and
maintenance minimization and training, Proc. Symp. Electron. Mainten., Washington,
D.C., 1955.
16. AMC Memo AMCRD-ES, Subject: Economy vs. Life Cycle Cost, signed by Maj.
Gen. Guthrie, J.R., Director, Research, Development, and Engineering, Headquarters,
Washington, D.C., January 21, 1971.
17. Ankenbrandt, F.L., Ed., Electronic Maintainability, Vol. 3, Engineering Publishers,
Elizabeth, NJ, 1960.
18. AMCP-706-134, Maintainability Guide for Design, prepared by the Department of
the Army, Department of Defense, Washington, D.C., 1972.
19. Grant Ireson, W., Coombs, C.F., and Moss, R.Y., Handbook of Reliability Engineering
and Management, McGraw-Hill Companies, New York, 1996.
20. Dhillon, B.S., Engineering Design: A Modern Approach, Richard D. Irwin, Chicago,
IL, 1996.
21. Pecht, M., Ed., Product Reliability, Maintainability, and Supportability Handbook,
CRC Press, Boca Raton, FL, 1995.
22. Websters Encyclopaedic Dictionary, Lexicon Publication, New York, 1988.
23. Brauer, D.C. and Brauer, G.D., Reliability centered maintenance, IEEE Trans. Reliability, 36, 17-24, 1987.
24. MIL-HDBK-217, Reliability Prediction of Electronic Equipment, Department of
Defense, Washington, D.C.
25. Dhillon, B.S., Human Reliability: With Human Factors, Pergamon Press, New York,
1986.
26. NURG-0492, Fault Tree Handbook, prepared by the U.S. Nuclear Regulatory Commission, Washington, D.C., 1981.
16
16.1 INTRODUCTION
Total quality management (TQM) is an enhancement to the conventional approach
of conducting business and it may simply be stated as a philosophy of pursuing
continuous improvement in each and every process through the integrated efforts of
all concerned persons associated with the organization.
The roots of the total quality movement may be traced back to the early 1900s
in the works of Frederick W. Taylor, the father of scientific management, concerning
the time and motion studies [1, 2]. In 1924, to control product variables, Walter A.
Shewhart, working as a quality control inspector for Bell Telephone Laboratories,
developed a statistical chart. This development is probably considered as the beginning of the statistical quality control.
In the late 1940s, the efforts of quality gurus such as W.E. Denning, J. Juran,
and A.V. Feigenbaum played an instrumental role in the strengthening of the TQM
movement [3]. In 1950, Deming lectured on the principles of statistical quality
control to 230 Japanese engineers and scientists at the request of the Japanese Union
of Scientists and Engineers (JUSE) [4]. In turn, in 1951, JUSE established the
Deming prize to be awarded to a company demonstrating the most effective implementation of quality measures and policies [5]. However, the term Total Quality
Management was not coined until 1985 and, in fact, it is credited to an American
behavioral scientist, Nancy Warren [5].
After witnessing the success of the Deming prize in Japan, the U.S. government
established the Malcolm Baldrige Award in 1987 for companies demonstrating
effective implementation of quality assurance policies and measures. In 1988, the
Cellular Telephone Division of Motorola was the first recipient of the Malcolm
Baldrige Award for its achievements in reducing defects from 1060/106 to 100/106
during the time frame from 1985 to 1988 [6, 7].
Since the late 1980s many other important events related to TQM have occurred.
Risk is present in all human activity and it can either be health and safety related
or economic. An example of the economic related risk is loss of equipment/production due to accidents involving fires, explosions, and so on. Nonetheless, risk may
simply be described as a measure of the probability and security of a negative effect
to health, equipment/property, or the environment [8]. Insurance is one of the oldest
strategies for dealing with risks and its history can be traced back to about 4000 years
in Mesopotamia (modern Iraq) when the Code of Hamurabi, in 1950 BC, formalized
bottomry contracts containing a risk premium for the chance of losing ships and
their cargo [9]. In the same area (i.e., Tigris-Euphrates Valley) around 3200 BC a
group of people known as the Asipu served as risk analysis consultants to other
people involved in making risky, uncertain, or difficult decisions.
In the fourth century BC a Greek doctor, Hippocrates, correlated occurrence of
discases with environmental exposures and in the sixteen century AD Agricola
identified the correlation between occupational exposure to mining and health [9].
The basis of modern quantitative risk analysis was developed by Pierre Laplace
in 1792 by calculating the probability of death with and without smallpox vaccination. In the twentieth century, the conceptual development of risk analysis is due to
factors such as the development of nuclear power plants and concerns about their
safety and the establishment of such U.S. bodies as the Environmental Protection
Agency (EPA), Occupational Safety and Health (OSHA), and National Institute for
Occupational Safety and Health (NIOSH). A more detailed history of risk analysis
and risk management is provided in Reference 10.
Both TQM and risk assessment are described below.
16.2 TQM
A key to success in the corporate world has been the ability to improve on business
processes and operational tasks. Today, the playing field for company products has
changed forever because competition is no longer based solely on price, but rather
on a combination of price and quality. It means the combination must surpass the
competition offered by the competitive products for the ultimate success. Consequently, it may be said that companies must continually improve and upgrade their
capabilities in order to satisfy changing customer needs. Thus, quality goals and
measuring attainment of such goals is a dynamic process. In other words, achieving
quality is a journey without end.
A survey of 100 U.S. executives conducted in 1989 revealed that only 22% of
them stated that their company had done all it could to create a quality-fostering
environment [11, 12]. This was probably an important factor for picking up steam
by the TQM movement in the U.S. Nonetheless, according to Reference 11, the
rationale for embarking upon TQM varied among the U.S. companies and the
principal reasons included increased competition, customer demands, employees,
and internal crisis.
It is important to remember that many times there could be good scientific
innovations but due to an effective practice of a TQM methodology, the organizations/nations may lose the competitive advantage. For example, during the period
between the mid-1950s to the mid-1980s, British researchers won 26 Nobel Prizes
for new discoveries in comparison to only 4 by the Japanese [12]. But during the
same time period, the Japanese industry successfully took a major share of the world
market.
FIGURE 16.1 The main areas of differences between the traditional quality assurance
program and TQM.
TABLE 16.1
Differences Between the Traditional Quality Assurance Approach (TQAA) and
TQM
Area
Objective
Definition
Cost
Customer
Quality defined
Quality
responsibility
Decision making
TQAA
TQM
Find errors
Product driven
Improvements in quality
increase cost
Ambiguous understanding of
customer needs
Products meet specifications
Inspection center/quality
control department
Practiced top-down approach
Prevent errors
Customer driven
Improvements in quality lower cost and increase
productivity
An effective approach defined to comprehend and
meet customer requirements
Products suitable for consumer use
All employees in the organization involved
AND
ELEMENTS
TABLE 16.2
Selective TQM Methods Belonging
to the Analytical Classification
Method
No.
1
2
3
4
5
6
7
8
9
10
Method name
Taguchi methods
Cause and effect analysis
Force field analysis
Domainal mapping
Solution effect analysis
Failure mode and effect analysis
Tolerance design
Fault tree analysis
Minute analysis
Paired comparisons
the primary role of the TQM methods in problem-solving for improving quality is
to be effective in satisfying customer needs. Furthermore, the methods can also be
useful to identify possible root causes and their potential solutions, in addition to
utilizing data/information to choose the most suitable alternatives for managing
quality.
There are a large number of methods that can be used in TQM work. In fact,
100 such methods are described in Reference 20. This reference classifies the TQM
methods into four distinct categories:
Analytical
Management
Idea generation
Data collection, analysis, and display
Tables 16.2 through 16.5 present selective methods belonging to each of the above
four categories, respectively. Some of these TQM methods are described below
[20-22].
Pareto Analysis
This is performed to separate the most important causes of a problem from the trivial
ones. It is named after an Italian economist called Vilfredo Pareto (18481923) and
its popular use in quality work is due to J.M. Juran who emphasized that 80% of
quality problems are the result of only 20% of the causes. Thus, Pareto analysis is
an extremely useful TQM method to highlight areas for a concerted effort.
Sometime Pareto analysis is also called the 80/20 rule. Nonetheless, the following steps are used to perform such analysis [20]:
TABLE 16.3
Selective TQM Methods Belonging to
the Management Classification
Method
No.
Method name
1
2
3
4
5
6
7
8
9
10
Kaizen
Pareto analysis
Quality circles
Quality function deployment (QFD)
Potential problem analysis
Deming wheel (PDCA)
Error proofing (pokayoke)
Benchmarking
Arrow diagram
Mystery shopping
TABLE 16.4
Selective TQM Methods Belonging
to the Idea Generation Classification
Method
No.
1
2
3
4
5
6
7
8
9
10
Method name
Brainstorming
Morphological forced connections
Opportunity analysis
Mind mapping
Nominal group technique
Imagineering
Snowballing
Buzz groups
List reduction
Multi-voting
TABLE 16.5
Selective TQM Methods Belonging to the Data
Collection, Analysis, and Display Classification
Method
No.
1
2
3
4
5
6
7
8
9
10
Method name
Histograms
Process analysis
Box and whisker plots
Checksheets
Scatter diagrams
Hoshin Kanri (quality policy deployment)
Flowcharts
Spider web diagrams
Statistical process control (SPC)
Dot plots
side represents all the possible causes which are connected to the central line known
as the Fish spine.
In particular, the cause-and-effect diagram with respect to TQM can be described
as follows: the customer satisfaction could be the effect, and manpower, materials,
methods, and machinery are the major causes. The following steps are useful in
developing a cause-and-effect diagram:
Develop problem statement.
Brainstorm to highlight all possible causes.
Establish main cause classifications by stratifying into natural groupings
and process steps.
Develop the diagram by connecting the causes by following essential
process steps and fill in the problem or the effect in the diagram box on
the extreme right.
Refine cause classifications through questions such as follows:
What causes this?
Why does this condition exist?
There are many advantages of the cause-and-effect diagram, including useful to
highlight root causes, a useful tool to generate ideas, an effective mechanism to
present an orderly arrangement of theories, and a useful vehicle to guide further
inquiry.
Quality Function Deployment (QFD)
This is a method used for optimizing the process of developing and manufacturing
new products as per customer need. The method was developed by the Japanese in
1972 [22, 25] to translate customer needs into appropriate technical requirements.
The approach can be applied in areas such as research, product development, engineering, manufacturing, and marketing. The technique makes use of a set of matrices
to relate customer needs to counterpart characteristics expressed as technical specifications and process control requirements. The important QFD planning documents
include customer needs planning matrix, process plan and quality control charts,
operating instructions, and product characteristic deployments matrix.
The customer need planning matrix is used to translate the consumer requirements into product counterpart characteristics and the purpose of the process plan
and quality control charts is to identify important process and product parameters
along with control points. The operating instructions are useful to identify operations
that must be accomplished to achieve critical parameters and on the other hand the
product characteristic deployment matrix is used to translate final product counterpart characteristics into critical component characteristics.
A QFD matrix is frequently called the House of Quality because of its resemblance to a house. The following steps are necessary to build the house of quality
[22, 25]:
Identify the needs of customers.
Identify the essential product/process characteristics that will meet the
customer requirements.
Establish the relationship between the customer needs and the counterpart
characteristics.
Conduct evaluation analysis of competing products.
Develop counterpart characteristics of competing products and formulate
appropriate goals.
Identify counterpart characteristics to be utilized in the remaining process.
One clear cut advantage of the QFD is that it encourages companies to focus
on the process itself rather than focusing on the product or service. Furthermore,
the development of correlations between what is required and how it is to be acquired,
the important areas become more apparent, help in making decisions.
One important limitation associated with the application of QFD is that the exact
needs must be identified in complete detail.
AND
GOALS
FOR
TQM
It may be said that quality starts with the product design specification writing phase
and continues to its operational phase. Nonetheless, there are a number of product
design elements that can adversely affect quality. Some of these elements are poor
part tolerances, lack of design refinement resulting in the use of more parts than
required to carry out the desired functions, difficulty in fabricating parts because of
poor design features, and the use of fragile parts.
There are various measures that can be useful to improve quality during the
product design phase including eliminating the need for adjustments, using repeatable
and well-understood processes, using parts that can easily withstand process operations, reducing parts and part numbers, eliminating engineering changes on released
products, simplifying assembly and making it foolproof, designing for robustness
using Taguchi methods, and designing for efficient and satisfactory testing [26].
There are a number of goals that must be satisfied in order to achieve the
effectiveness of the TQM process. These goals include:
AND
REQUIRED
Risk can only be managed effectively after its comprehension analysis. Risk analysis
serves as a useful tool in identifying health and safety problems and approaches to
uncover their solutions, satisfying regulatory requirements, and facilitating objective
decisions on the risk acceptability.
A multi-disciplinary approach is often required to conduct risk analysis and it
may require adequately sufficient knowledge in areas such as probability and statistics, engineering (electrical, mechanical, chemical, or nuclear), systems analysis,
health sciences, social sciences, and physical, chemical, or biological sciences.
IN
HAZARDOUS
The life cycle of hazardous systems may be divided into three major phases as shown
in Figure 16.3.
The establishing scope definition is the first step of the risk analysis process and
the risk analysis scope is defined and documented at the beginning of the project
after a thorough understanding of the system under consideration. Nonetheless, the
following five basic steps are involved in defining the risk analysis scope:
1. Describe the problems leading to risk analysis and then formulate risk
analysis objectives on the basis of major highlighted concerns.
2. Define the system under consideration by including factors such as system
general description, environment definition, and physical and functional
boundaries definition.
3. Describe the risk analysis associated assumptions and constraints.
4. Highlight the decisions to be made.
5. Document the total plan.
Identifying hazards is the second step of the risk analysis process and is basically
concerned with the identification of hazards that will lead to risk in the system. This
step also calls for the preliminary evaluation of the significance of the identified
hazardous sources. The main purpose of this evaluation is to determine the appropriate course of action.
Estimating risk is the third step of the risk analysis process and risk estimation
is conducted in the following steps:
Investigate sources of hazards to determine the probability/likelihood of
occurrence of the originating hazard and associated consequences.
Conduct pathway analysis to determine the mechanisms and likelihood
through which the receptor under consideration is influenced.
Choose risk estimation methods/approaches to be used.
Identify data needs.
Discuss assumptions/rationales associated with methods/approaches/data
being utilized.
Estimate risk to evaluate the degree of influence on the receptor under
consideration.
Document the risk estimation study.
Documenting is the fourth step of the risk analysis process and is basically concerned
with effectively documenting the risk analysis plan, the preliminary evaluation, and
the risk estimation. The documentation report should contain sections such as title,
abstract, conclusions, table of contents, objectives/scope, assumptions/limitations,
system description, analysis methodology description, results of hazard identification, model description and associated assumptions, quantitative data and associated
assumptions, results of risk estimation, references, appendices, discussion of results,
and sensitivity analysis.
The fifth step of the risk analysis process is basically concerned with verifying
the end results. More specifically, it may be stated that verification is a review process
used to determine the integrity and accuracy of the risk analysis process. Verification
is conducted at appropriate times by person(s) other than the involved analyst(s).
The final step of risk analysis is concerned with periodically updating the
analysis as more up-to-date information becomes available.
FOR
ENGINEERING SYSTEMS
Over the years many different methods to conduct risk analysis have been
developed [8, 30-32]. It is important to carefully consider the relevance and suitability of these techniques prior to their proposed applications. Some of the factors
to be considered specifically in this regard are appropriateness to the system under
study, scientific defensibility, format of the results with respect to improvement in
understanding of the risk occurrence and risk controllability, and simplicity.
The additional factors that should be used as the basis for selecting risk analysis
technique(s) by the analyst include study objectives, development phase, level of risk,
system type and hazard being analyzed, information and data needs, manpower requirement, level of expertise required, updating flexibility, and resource requirement.
The risk analysis methods used for engineering systems may be grouped into
two categories: (1) hazard identification and (2) risk estimation. Examples of the
techniques belonging to the hazard identification category are hazard and operability
study (HAZOP), event tree analysis (ETA), and failure modes and effect analysis
(FMEA). Similarly, the examples of the techniques belonging to the risk estimation
category include frequency analysis and consequence analysis.
Four of these techniques are discussed below.
Consequence Analysis
Consequence analysis is concerned with estimating the impact of the undesired event
on adjacent people, property, or the environment. Usually, for risk estimations
concerning safety, it consists of calculating the probability that people at different
distances (and in different environments) from the undesired event source will suffer
injury/illness. Some examples of the undesired event are explosions, projection of
debris, fires, and release of toxic materials. It means there is a definite need to use
consequence models for predicting the extent and probability of casualties due to
such undesired events. Nonetheless, it is appropriate that consequence analysis takes
into consideration the factors such as analysis based on chosen undesirable events,
measures to eradicate consequences, explanation of any series of consequences
resulting from the undesirable events, measures to eradicate consequences, explanation of any series of consequences resulting from the undesirable events, outlining
of the criteria used for accomplishing the identification of consequences, and immediate and aftermath consequences.
Hazard and Operability Study (HAZOP)
This is a form of FMEA and was developed for applications in chemical industries.
HAZOP is a systematic approach used to identify hazards and operational problems
throughout a facility. Over the years, it is proven to be an effective tool for identifying
unforeseen hazards designed into facilities due to various reasons, or introduced into
existing facilities due to factors such as changes made to process conditions or
operating procedures.
The approach has primarily threefold objectives: develop a full facility/process
description, systematically review each and every facility/process part to identify
how deviations from the design intentions can occur, and pass judgment on whether
such deviations can result in hazards/operating problems.
HAZOP can be used to analyze design at various stages as well as to perform
analysis of process plants in operation. However, the application of HAZOP in the
early design phase often leads to a safer detailed design. The following basic steps
are associated with HAZOP [8]:
Develop study objectives and scope.
Form HAZOP team by ensuring that its membership is comprised of
persons from design and operation.
Collect appropriate drawings, process description, and other relevant documentation; for example, layout drawings, process control logic diagrams,
operation and maintenance procedures, and process flowsheets.
Conduct analysis of all major pieces of equipment, supporting equipment,
etc.
Effectively document the study.
Frequency Analysis
This is basically concerned with estimating the occurrence frequency of each undesired event or accident scenario. Two commonly used approaches for performing
frequency analysis are as follows:
Making use of the frequency data of past relevant undesired events to
predict frequency of their future occurrences.
Using methods such as ETA and fault tree analysis (FTA) to calculate the
occurrence frequencies of undesired events.
All in all, each of the above two techniques has strengths where the other has
weaknesses; thus, each such approach should be used to serve as a check for the
other wherever it is feasible.
Event Tree Analysis (ETA)
This is a bottom-up approach used to identify the possible outcomes when an
initiating events occurrence is known. The approach is useful for analyzing facilities
having engineered accident-mitigating characteristics to identify the events
sequence that follows the initiating event and generate given sequences. Usually, it
is assumed that each sequence event is either a success or a failure. It is important
to note that often the ETA approach is used to perform analysis of more complex
systems then the ones handled by the FMEA approach [30, 33, 34].
Because of the inductive nature of ETA, the fundamental question asked is What
happens if ? ETA highlights the relationship between the success or failure of
various mitigating systems as well as the hazardous event that follows the initiating
event. Additional factors associated with ETA include
A comprehensive risk assessment requires identification of all potential
initiating events.
An excellent tool to identify events that require further investigation using
FTA.
It is rather difficult to incorporate delayed success or recovery events when
performing ETA.
ETA application always leaves room for missing some important initiating
events.
16.3.5 ADVANTAGES
OF
RISK ANALYSIS
There are many benefits to performing risk analysis. Some of those are as follows [8]:
It
It
It
It
It
It
It
16.4 PROBLEMS
1. Write an essay on TQM.
2. Define the following terms:
Risk
Risk management
Risk assessment
Risk control
3. Compare traditional quality assurance program with TQM.
4. Discuss Demings 14 points concerning TQM.
5. List 10 most important methods of TQM. Provide a short discussion on
each of these methods.
6. What are the Demings deadly diseases of American management with
respect to quality?
7. What are the common errors made when starting quality initiatives?
8. Describe the risk analysis process.
9. What are the benefits of risk analysis?
10. Discuss the following two risk analysis methods:
HAZOP
ETA
16.5 REFERENCES
1. Goetsch, D.L. and Davis, S., Implementing Total Quality, Prentice-Hall, Englewood
Cliffs, NJ, 1995.
2. Rao, A., Carr, L.P., Dambolena, I., Kopp, R.J., Martin, J., Raffi, F., and Schlesinger,
P.F., Total Quality Management: A Cross Functional Perspective, John Wiley & Sons,
New York, 1996.
3. Gevirtz, C.D., Developing New Products with TQM, McGraw-Hill, New York, 1994.
4. Dobyns, L. and Crawford-Mason, C., Quality or Else, Houghton Mifflin, Boston, 1991.
5. Schmidt, W.H. and Finnigan, J.P., The Race without a Finish Line: Americas Quest
for Total Quality, Jossey-Bass Publishers, San Francisco, CA, 1992.
6. Van Ham, K., Setting a total quality management strategy, in Global Perspectives on
Total Quality, The Conference Board, New York, 1991.
7. Muadu, C.N. and Chu-hua, K., Strategic total quality management (STQM), in
Management of New Technologies for Global Competitiveness, Madu, C.N., Ed.,
Quorum Books, Westport, CT, 1993, pp. 3-25.
8. Risk Analysis Requirements and Guidelines, CAN/CSA-Q6340-91, prepared by the
Canadian Standards Association, 1991. Available from Canadian Standards Association, 178 Rexdale Boulevard, Rexdale, Ontario, Canada.
9. Molak, V., Ed., Fundamentals of Risk Analysis and Risk Management, CRC Press,
Boca Raton, FL, 1997.
10. Covello, V.T. and Manpower, J., Risk analysis and risk management: A historical
perspective, Risk Analysis, 5, 103-120, 1985.
11. Farquhar, C.R. and Johnston, C.G., Total Quality Management: A Competitive Imperative Report No. 60-90-E, 1990, The Conference Board of Canada, Ottawa, Ontario,
Canada.
12. Caropreso, F., Making Total Quality Happen, The Conference Board, New York, 1990.
13. Spenley, P., World Class Performance Through Total Quality, Chapman and Hall,
London, 1992.
14. Burati, J.L., Matthews, M.F., and Kalidindi, S.N., Quality management organizations
and techniques, J. Construction Eng. Manage., 118, 112-128, 1992.
15. Matthews, M.F. and Burati, J.L., Quality Management Organizations and Techniques,
Source Document 51, The Construction Industry Institute, Austin, Texas, 1989.
16. Imai, M., Kaizen: The Key to Japans Competitive Success, Random House, New
York, 1986.
17. Ishikawa, K., Guide to Quality Control, Asian Productivity Organization, Tokyo,
1982.
18. Kume, H., Statistical Methods for Quality Improvement, The Association for Overseas
Technology Scholarship, Tokyo, 1985.
19. Perisco, J., Team up for quality improvement, Quality Progress, 22, 33-37, 1989.
20. Kanji, G.K. and Asher, M., 100 Methods for Total Quality Management, Sage Publications Ltd., London, 1996.
21. Heizer, J. and Render, B., Production and Operations Management, Prentice-Hall,
Upper Saddle River, New Jersey, 1995.
22. Mears, P., Quality Improvement Tools and Techniques, McGraw-Hill, New York, 1995.
23. Uselac, S., Zen Leadership: The Human Side of Total Quality Team Management,
Mohican Publishing Company, Londonville, OH, 1993.
24. Akao, Y., Hoshin Kanri: Policy Deployment for Successful TQM, Productivity Press,
Cambridge, MA, 1991.
25. Yoji, K., Quality Function Deployment, Productivity Press, Cambridge, MA, 1991.
26. Daetz, D., The effect of product design on product quality and product cost, Quality
Progress, June 1987, pp. 63-67.
27. Coppola, A., Total quality management, in Tutorial Notes, Annu. Reliability Maintainability Symp., 1992, pp. 1-44.
28. Clemmer, J., Five common errors companies make starting quality initiatives, Total
Quality, 3, 4-7, 1992.
29. Kunreuther, H. and Slovic, P., Eds., Challenges in risk assessment and risk management, in Annu. Am. Acad. Political Soc. Sci., 545, 1-220, 1996.
30. Wesely, W.E., Engineering risk analysis, in Technological Risk Assessment, Rice, P.F.,
Sagan, L.A., and Whipple, C.G., Eds., Martinus Nijhoff Publishers, The Hague, 1984,
pp. 49-84.
31. Covello, V. and Merkhofer, M., Risk Assessment and Risk Assessment Methods: The
State-of- the-Art, NSF report, 1984, National Science Foundation (NSF), Washington,
D.C.
32. Dhillon, B.S. and Rayapati, S.N., Chemical systems reliability: A survey, IEEE Trans.
Reliability, 37, 199-208, 1988.
33. Cox, S.J. and Tait, N.R.S., Reliability, Safety, Risk Management, Butterworth-Heinemann Ltd., Oxford, 1991.
34. Ramakumar, R., Engineering Reliability: Fundamentals and Applications, PrenticeHall, Englewood Cliffs, NJ, 1993.
17
17.1 INTRODUCTION
Today in the global economy and due to various other market pressures, the procurement decisions of many products are not entirely made on initial acquisition
costs but on their total life cycle costs; in particular is the case of expensive products.
Many studies performed over the years indicate that the product ownership costs
often exceed procurement costs. In fact, according to Reference 1, the product
ownership cost (i.e., logistics and operating cost) can vary from 10 to 100 times the
procurement cost. Even the assertion of the ownership cost being high could be
detected from the overall annual budgets of various organizations. For example, in
1974, the operation and maintenance cost accounted for 27% of the total U.S.
Department of Defense budget as opposed to 20% for procurement [2].
Life cycle cost of a product may simply be described as the sum of all costs
incurred during its life span, i.e., the total of procurement and ownership costs. The
history of life cycle costing goes back to the mid-1960s when Logistics Management
Institute prepared a document [3] entitled Life Cycle Costing in Equipment Procurement for the U.S. Assistant Secretary of Defense, Installations and Logistics.
Consequently, the Department of Defense released a series of three guidelines for
life cycle costing procurement [4-6].
In 1974, Florida became the first U.S. state to formally adopt the concept of life
cycle costing and in 1978, the U.S. Congress passed the National Energy Conservation Policy Act. The passage of this Act made it mandatory that every new federal
government building be life cycle cost effective. In 1981, Reference 7 presented a
list of publications on life cycle costing. Over the years many people have contributed
to the subject of life cycle costing and Reference 8 presents a comprehensive list of
publications on the subject. This chapter discusses different aspects of life cycle
costing.
Estimate goal
Time schedule
Required data
Involved individuals
Required analysis format
Required analysis detail
Treatment of uncertainties
Ground rules and assumptions
Analysis constraints
Analysis users
Limitations of funds
Required accuracy of the analysis
Nonetheless, the specific data required to conduct life cycle cost studies for an item
include useful life, acquisition cost, periodic maintenance cost, transportation and
installation costs, discount and escalation rates, salvage value/cost, taxes (i.e., investment tax credit, tax benefits from depreciation, etc.), periodic operating costs (e.g.,
energy cost, labor cost, insurance, cost of materials, and cost of supplies) [12].
Design engineers
Maintenance engineers
Tooling engineers
Reliability and maintainability engineers
Planning engineers
Test engineers
Quality control engineers
Manufacturing engineers
The documentation of the results of the life cycle cost analysis is as important as
the analysis itself. Poor documentation can dramatically decrease the effectiveness
of actual analysis. Needless to say, the writing of life cycle cost estimate reports
requires careful consideration. Even though such reports may vary from one
project/organization to another, they should include information on items such as
those listed below [8].
(17.1)
Example 17.1
A person deposited $700 dollars in a bank at a compound interest rate of 5% per
year. Calculate its future amount after 7 years.
By inserting the given data into Equation (17.1) we get
FA = 700 (1 + 0.05)
$985
(17.2)
where
P is the present value of FA.
Example 7.2
A company sold a personal computer and the buyer agreed to pay $5000 after 5 years.
The estimated interest rate is 6% per year. Calculate the present value of the amount
to be paid by the buyer.
Inserting the specified data into Equation (17.2) yields
P = 5, 000 (1 + 0.06)
$3736
X
X
X
X
+
+
+ +
(1 + i) (1 + i)2 (1 + i)3
(1 + i)m
(17.3)
where
PVi is the present value of payment X made at the end of interest period i;
for i = 1, 2, - - - -, m.
In order to find the sum of the geometric series represented by Equation (17.3), we
multiply both sides of this equation by 1/(1 + i) to get
PV
X
X
X
X
=
2 +
3 +
4 + +
1 m+
(1 + i) (1 + i) (1 + i) (1 + i)
(1 + i)
(17.4)
(1 + i)
PV =
(1 + i)m +1 (1 + i)
(17.5)
(17.6)
Example 17.3
A machines expected useful life is 10 years and its use will generate an estimated
revenue of $20,000 at the end of each year. Calculate the present value of the total
revenue to be generated by the machine, if the annual interest rate is expected to be
5%.
Substituting the given data into Equation (17.6) yields
1 (1 + 0.05)10
PV = (20, 000)
0.05
$154, 435
m2
+ X (1 + i)
m 1
(17.7)
As Equation (17.7) is a geometric series, its sum can be found. Thus, we multiply
both sides of Equation (17.7) by factor (1 + i) to get
(17.8)
(1 + i) FA t FA t = X (1 + i)m X
(17.9)
FA t = X (1 + i) 1 i
(17.10)
Example 17.4
In Example 17.3, instead of calculating the present value of the total income, find
the total future amount.
Substituting the given data into Equation (17.10) yields
$251, 558
(17.11)
In turn, RC has five major components: maintenance cost, operating cost, support
cost, labor cost, and inventory cost.
Similarly, the elements of the NRC are procurement cost, training cost, LCC
management cost, support cost, transportation cost, research and development cost,
test equipment cost, equipment qualification approval cost, installation cost, and
reliability and maintainability improvement cost.
General Life Cycle Cost Model II
In this case, the equipment life cycle cost is categorized into four major classifications: research and development cost (RDC), production and construction cost
(PCC), operation and support cost (OSC), and disposal cost (DC). Thus, mathematically, the equipment LCC is given by
LCC = RDC + PCC + OSC + DC
(17.12)
The major elements of the RDC are engineering design cost, software cost, life
cycle management cost, product planning cost, test and evaluation cost, research
cost, and design documentation cost.
PCC is made up of five important elements: construction cost, manufacturing
cost, initial logistics support cost, industrial engineering and operations analysis
cost, and quality control cost.
The major components of the OSC are product distribution cost, product operation cost, and sustaining logistic support cost.
Finally, DC is expressed by
DC = URC + [CF () (IDC RV)]
(17.13)
where
CF
URC
IDC
RV
is
is
is
is
is
the
the
the
the
the
(17.14)
(17.15)
where
RPC
ALC
MTTR
RT
HC
MAC
LSC
(17.16)
where
MTBF is the mean time between failures in hours.
LT
is the life time of the item in hours.
The failure loss cost rate per hour, FCR, is defined by
FCR = AC MTBF
(17.17)
(17.18)
(17.19)
FC + ( RC + SC)
(17.20)
In turn, FC is given by
where
RC
SC
is
is
is
is
the
the
the
the
(17.21)
where
USC is the unit spare cost.
Q
is the fractional number of spares for each active unit.
Inserting Equations (17.20) and (17.21) into Equation (17.19) yields
LCC = IC + [RC + (USC) Q]
(17.22)
(17.23)
The predicted breakdown percentages of the LCC for AC, OC, LSC were 28%, 12%,
and 60%, respectively.
The four major components of the acquisition cost were fabrication cost, installation and checkout cost, design cost, and documentation cost. Their predicted
breakdown percentages were 20.16%, 3.92%, 3.36%, and 0.56%, respectively.
OC was made up of three principal elements (each elements predicted breakdown percentage is given in parentheses): cost of personnel (8.04%), cost of power
(3.84%), and cost of fuel (0.048%).
The logistic support cost was expressed by
LSC = RLC + ISC + AC + RMC + ITC + RSC
where
RLC is the repair labor cost.
ISC is the initial spares cost.
(17.24)
AC
RMC
ITC
RSC
is
is
is
is
the
the
the
the
age cost.
repair material cost.
initial training cost.
replacement spares cost.
The predicted breakdown percentages of the LCC for each of these six cost elements
were 38.64%, 3.25%, 1.186%, 5.52%, 0.365%, and 11.04%, respectively.
Appliance Life Cycle Cost Model
This model is proposed to estimate life cycle cost of major appliances [22]. In this
case, the life cycle cost of an appliance is expressed by
n
LCC = APC +
FC (1 + FER)k
(1 + i)k
EC
k
k =1
(17.25)
where
APC is the appliance acquisition cost in dollars.
ECk is the appliance energy consumption in year k, expressed in million
BTUs.
n
is the appliance useful life in years.
FC is the fuel cost in year one, expressed in constant dollars per million
BTUs.
FER is the annual fuel escalation rate (%) in constant dollars.
i
is the discount rate (%) in constant dollars.
If ECk and FER are constant over the useful life span of the appliance, Equation
(17.25) becomes
FC (1 + FER)
(1 + i)
k =1
n
(17.26)
The typical useful life of appliances such as refrigerators, freezers, ranges and ovens,
electric dryers, and room air conditioners are 15, 20, 14, 14, and 10 years, respectively.
Example 17.5
A company using a machine to manufacture certain engineering parts is contemplating replacing it with a better one. Three different machines are being considered
for its replacement and their data are given in Table 17.1. Determine which of the
three machines should be purchased to replace the existing one with respect to their
life cycle costs.
TABLE 17.1
Data for Three Machines Under Consideration
No.
Description
Machine A
Machine B
Machine C
1
2
3
4
5
6
Procurement price
Expected useful life in years
Annual failure rate
Annual interest rate
Annual operating cost
Cost of a failure
$175,000
12
0.02
5%
$5,000
$4,000
$160,000
12
0.03
5%
$7,000
$3,000
$190,000
12
0.01
5%
$3,000
$5,000
0.05
$709
where
PVFCA is the present value of Machine A failure cost over its life span.
Similarly, the present values of Machines B and C failure costs over their useful
lives are as follows:
1 (1 + 0.05)12
PVFCB = 90
0.05
$798
and
1 (1 + 0.05)12
PVFCC = 50
0.05
$443
The present values of Machines A, B, and C operating costs over their useful
life span, using Equation (17.6) and the given data, are
1 (1 + 0.05)12
PVOC A = 5, 000
0.05
$44, 316
1 (1 + 0.05)12
PVOCB = 7, 000
0.05
$62, 043
and
1 (1 + 0.05)12
PVOCC = 3, 000
0.05
$26, 590
where
PVOCA is the present value of Machine A operating cost over its useful life.
PVOCB is the present value of Machine B operating cost over its useful life.
PVOCC is the present value of Machine C operating cost over its useful life.
The life cycle costs of Machines A, B, and C are
LCC A = 175, 000 + 709 + 44, 316
= $220, 025
LCCB = 160, 000 + 798 + 62, 043
= $222, 841
and
LCCC = 190, 000 + 443 + 26, 590
= $217, 033
As the life cycle cost of Machine C is the lowest, thus it should be purchased.
(17.27)
where
Cn
KPn
Cod
KPod
Example 17.6
An electric utility spent $1.5 billion to build a 3,000 megawatt nuclear power
generating station. In order to meet the growing demand for electricity, the company
plans to construct a 4,000 megawatt nuclear station. Estimate the cost of the new
station, if the value of the cost-capacity factor is 0.8.
Substituting the given data into Equation (17.27) yields
C n = 1.5 ( 4, 000 3, 000)
0.8
= $1.89 billion
Thus, the new station will cost $1.89 billion to construct.
(17.28)
where
Com
hp
Tom
Ke
is
is
is
is
is
the
the
the
the
the
Example 17.7
A 40-hp AC motor is operated for 3500 h per year and the cost of electricity is
$0.08/kWh. Calculate the annual operating cost of the motor, if the motor efficiency
is 90%.
Substituting the given data into Equation (17.28) yields
Com = (0.746) ( 40) (3500) (0.08) 0.90
$9, 284
is
is
is
is
is
the
the
the
the
the
(17.29)
Example 17.8
An electric motor is scheduled for 4500 h of operation in one year and its estimated
MTBF is 1500 h. Whenever it fails, it takes on the average 15 h to repair. Calculate
the annual cost for the motor corrective maintenance labor, if the maintenance labor
cost is $30 per hour.
Inserting the given data into Equation (17.29) we get
Ccm = ( 4, 500) (30) (15 1500)
= $1350
Thus, the annual cost for the motor corrective maintenance labor is $1350.
17.11
PROBLEMS
TABLE 17.2
Data for Engineering Systems Offered by Two Different Manufacturers
No.
Description
Manufacturer As system
Manufacturer Bs system
1
2
3
4
5
6
Purchasing price
Cost of failure
Annual interest rate
Annual operating cost
Annual failure rate
Expected useful life in years
$80,000
$10,000
6%
$4,000
0.004
15
$60,000
$5,000
6%
$5,500
0.004
15
17.12
REFERENCES
1. Ryan, W.J., Procurement views of life cycle costing, Proc. Annu. Symp. Reliability,
164-168, 1968.
2. Wienecke-Louis, E. and Feltus, E.E., Predictive Operations and Maintenance Cost
Model, Report No. ADA078052, 1979. Available from the National Technical Information Service (NTIS), Springfield, VA.
3. Life Cycle Costing in Equipment Procurement, Report No. LMI Task 4C-5, prepared
by Logistics Management Institute (LMI), Washington, D.C., April 1965.
4. Life Cycle Costing Procurement Guide (interim), Guide No. LCC1, Department of
Defense, Washington, D.C., July 1970.
5. Life Cycle Costing in Equipment Procurement-Casebook, Guide No. LCC2, Department of Defense, Washington, D.C., July 1970.
6. Life Cycle Costing for System Acquisitions (interim), Guide No. LCC3, Department
of Defense, Washington, D.C., January 1973.
7. Dhillon, B.S., Life cycle cost: A survey, Microelectronics and Reliability, 21, 495-511,
1981.
8. Dhillon, B.S., Life Cycle Cost: Techniques, Models, and Applications, Gordon and
Breach Science Publishers, New York, 1989.
32. Corripio, A.B., Chrien, K.S., and Evans, L.B., Estimate costs of centrifugal pumps
and electric motors, Chem. Eng., 89(4), 115, 1982.
33. Purohit, G.P., Cost of double pipe and multi-tube heat exchangers, Chem. Eng., 92,
96, 1985.
34. Woods, D.R., Anderson, S.J., and Norman, S.L., Evaluation of capital cost data: Heat
exchangers, Can. J. Chem. Eng., 54, 469, 1976.
35. Kumana, J.D., Cost update on specialty heat exchangers, Chem. Eng., 91(13), 164,
1984.
36. Hall, R.S., Mately, J., and McNaughton, K.J., Current costs of process equipment,
Chem. Eng., 89(7), 80, 1982.
37. Klumpar, I.V. and Slavsky, S.T., Updated cost factors: Process equipment, commodity
materials, and installation labor, Chem. Eng., 92(15), 73, 1985.
38. Humphreys, K.K. and Katell, S., Basic Cost Engineering, Marcel Dekker, New York,
1981.
39. Peters, M.S. and Timmerhaus, K.D., Plant Design and Economics for Chemical
Engineers, McGraw-Hill, New York, 1980.
40. Fang, C.S., The cost of shredding municipal solid waste, Chem. Eng., 87(7), 151,
1980.